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Abstract 


Measurement  of  human  hand  and  body  motion  is  an  important  task  for 
applications  ranging  from  athletic  performance  analysis  to  advanced  user- 
interfaces.  Commercial  human  motion  sensors  are  invasive,  requiring  the 
user  to  wear  gloves  or  targets.  This  thesis  addresses  noninvasive  real-time 
3D  tracking  of  human  motion  using  sequences  of  ordinary  video  images.  In 
contrast  to  other  sensors,  video  cameras  are  passive  and  inobtrusive,  and 
can  easily  be  added  to  existing  work  environments.  Other  computer  vi¬ 
sion  systems  have  demonstrated  real-time  tracking  of  a  single  rigid  object 
in  six  degrees-of-freedom  (DOFs).  Articulated  objects  like  the  hand  present 
three  challenges  to  existing  rigid-body  tracking  algorithms:  a  large  number 
of  DOFs  (27  for  the  hand),  nonlinear  kinematic  constraints,  and  complex 
self-occlusion  effects.  This  thesis  presents  a  novel  tracking  framework  for 
articulated  objects  that  uses  explicit  kinematic  models  to  overcome  these 
obstacles. 

Kinematic  models  play  two  main  roles  in  this  work:  they  provide  geomet¬ 
ric  constraints  on  image  features  and  predict  self-occlusions.  A  kinematic 
model  for  hand  tracking  gives  the  3D  positions  of  the  hngers  as  a  function 
of  the  hand  state,  which  consists  of  the  pose  of  the  palm  and  the  huger  joint 
angles.  Image  features  for  the  hand  consist  of  lines  and  points  which  are 
obtained  by  projecting  huger  phalanges  and  tips  into  the  image  plane.  The 
kinematic  model  provides  a  geometric  constraint  on  the  image  plane  posi¬ 
tions  of  hand  features  as  a  function  of  the  hand  state.  Tracking  proceeds  by 
registering  the  projection  of  the  hand  model  with  measured  image  features 
at  a  high  frame  rate. 

Self  occclusions  are  modeled  by  arranging  the  image  features  in  overlap¬ 
ping  layers,  ordered  by  their  visibility  to  the  camera.  The  layered  repre¬ 
sentation  is  generated  automatically  by  the  kinematic  model  and  used  to 
constrain  registration.  This  framework  was  implemented  in  a  hand  tracking 
system  called  DigitEyes  and  tested  in  two  sets  of  experiments.  First,  a  hand 
was  tracked  in  real-time  using  two  cameras  and  a  27  DOF  model,  and  using 
a  single  camera  in  a  3D  mouse  user-interface  trial.  Second,  the  occlusion 
handling  framework  was  tested  oh-line  on  a  motion  sequence  with  signihcant 
self-occlusion.  These  results  illustrate  the  effectiveness  of  explicit  kinematic 
models  in  3D  tracking  and  analysis  of  self-occluding  motion. 
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Chapter  1 
Introduction 


Tracking  the  motion  of  hands  and  bodies  in  three  dimensions  (3D)  is  an 
important  task  for  applications  in  computer  graphics,  athletic  performance 
analysis,  and  user-interfaces.  Commercial  human  motion  sensors  are  invasive, 
requiring  the  user  to  wear  gloves  or  targets  [74,  37].  For  example,  current 
motion  capture  systems  work  by  recording  the  3D  trajectories  of  magnetic 
trackers  or  optical  targets  attached  to  the  user’s  hands  and  limbs.  These 
trajectories  are  used  in  computer  graphics  applications  to  imbue  animated 
characters  with  realistic  motion  [54].  In  other  examples,  various  glove-based 
sensors  for  palm  and  huger  motion  have  been  used  to  interpret  sign  lan¬ 
guage  [17]  and  control  3D  CAD  models  [9].  In  all  of  these  cases,  the  use¬ 
fulness  and  convenience  of  the  sensor  is  limited  by  the  need  to  wear  clumsy, 
bulky  devices,  often  tethered  to  an  external  computer. 

This  thesis  addresses  the  noninvasive  real-time  tracking  of  human  motion 
using  sequences  of  ordinary  video  images.  In  contrast  to  other  sensors,  video 
cameras  are  passive  and  inobtrusive,  and  can  easily  be  added  to  existing  work 
environments.  Other  computer  vision  systems  have  demonstrated  real-time 
tracking  of  a  single  rigid  object  in  six  degrees  of  freedom  (DOFs)  [20,  35]. 
Articulated  objects  like  human  hgures  and  hands  present  three  difhculties 
for  these  existing  algorithms:  the  large  number  of  DOFs  required  to  describe 
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their  motion,  nonlinearities  in  the  mapping  from  the  DOFs  to  the  image 
motion,  and  the  presence  of  complex  occlusion  effects,  when  one  part  of  the 
body  blocks  the  camera’s  view  of  another.  This  thesis  explores  the  use  of 
explicit  kinematic  models  in  a  local  tracking  approach  to  overcome  these 
difficulties.  It  describes  a  tracking  framework  for  general  articulated  objects 
and  presents  experimental  results  for  3D  hand  tracking  from  natural  image 
sequences. 

1.1  Tracking  with  Kinematic  Models 

The  kinematics  of  an  articulated  object  provide  the  most  fundamental  con¬ 
straint  on  its  motion.  Chapter  2  presents  a  general  model-based  framework 
for  tracking  with  kinematic  constraints;  this  section  outlines  its  application 
to  hand  tracking.  In  the  case  of  the  hand,  motion  of  the  hngers  and  palm 
in  3D  is  constrained  by  the  skeleton.  The  relationship  between  these  skele¬ 
tal  constraints  and  a  hand  image  is  illustrated  in  Fig.  1.1(a).  The  black 
overlay  shows  the  projection  of  a  3D  kinematic  hand  model,  illustrated  in 
Fig.  1.1(b),  into  the  image  plane.  The  huger  phalanges  (links)  are  drawn  as  a 
set  of  black  “T”  shapes,  connected  together  at  the  knuckles.  Fach  phalange 
is  represented  by  a  cylinder,  and  each  T  shows  the  radius  and  axis  of  the 
cylinder’s  projection  into  the  image.  When  the  model  has  been  registered  to 
the  image  correctly,  as  in  the  hgure,  the  projected  cylinders  are  aligned  with 
the  hngers. 

Local  tracking  consists  of  a  series  of  registration  problems  in  which  the 
conhguration  of  the  3D  hand  model  is  adjusted  so  that  its  projection  is 
aligned  with  the  current  image.  At  the  start  of  tracking,  the  image  and  the 
model  are  registered.  For  each  subsequent  image  in  the  motion  sequence, 
small  corrections  are  made  to  the  state  of  the  hand  that  minimize  the  reg¬ 
istration  error.  The  state  vector  for  the  hand  contains  the  pose  of  the  palm 
and  the  huger  joint  angles.  The  registration  error  is  described  by  a  residual 
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function^  which  is  minimized  by  the  state  correction  in  each  frame.  This 
thesis  explores  two  types  of  residual  functions:  Sum  of  Squared  Differences 
(SSD)  and  geometric  feature  residuals.  The  SSD  residual  measures  the  inten¬ 
sity  differences  between  the  image  and  a  template  model  for  each  body  in  the 
articulated  object.  A  collection  of  templates  can  represent  a  wide  variety  of 
link  shapes.  Furthermore,  since  templates  explicitly  describe  the  region  each 
link  occupies  in  the  image,  they  are  useful  in  tracking  self-occluding  objects, 
as  Chpt.  3  describes. 


Figure  1.1:  (a)  Hand  image  with  projection  of  3D  kinematic  model  overlaid 
in  black  and  detected  line  and  point  features  shown  in  white,  and  (b)  3D 
view  of  the  hand  model  which  is  registered  to  the  image  in  (a). 

Images  of  hands  and  bodies  can  also  be  described  by  a  collection  of  line 
and  point  features,  as  the  “image  skeleton”  shown  in  Fig.  1.1  illustrates. 
In  this  example,  pairs  of  lines  and  point  features,  drawn  in  white,  mark 
the  edges  of  the  huger  phalanges  and  the  huger  tip  centers.  The  geometric 
feature  residual  used  in  this  case  measures  the  distance  between  the  pro¬ 
jected  3D  model  (the  black  overlay)  and  the  measured  line  and  point  features 
(the  white  overlay.)  This  feature  residual  approximates  the  SSD  residual  for 
roughly  cylindrical  objects  like  huger  phalanges  and  limbs.  A  simple,  efh- 
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cient  algorithm  for  detecting  the  geometric  features  is  described  in  Chpt.  4. 
It  forms  the  basis  for  the  real-time  tracking  experiments  described  there. 

The  residual  error  for  each  image  is  minimized  using  a  gradient-based  ap¬ 
proach.  The  kinematic  Jacobian  for  the  articulated  object  is  a  key  component 
of  the  residual  gradient.  It  plays  a  role  in  articulated  object  tracking  that  is 
similar  to  its  use  in  robot  control.  This  duality  is  exploited  in  Sec.  2.5.4  in 
the  study  of  kinematic  singularities,  which  arise  when  certain  states  have  no 
instantaneous  effect  on  the  image  features.  The  geometric  feature  residual 
can  be  used  to  identify  these  singular  cases,  because  it  provides  a  closed-form 
expression  for  registration  error  as  a  function  of  the  state.  A  standard  tech¬ 
nique  for  stabilizing  rigid  body  trackers  is  shown  to  be  effective  in  dealing 
with  these  singularities. 


1.2  Tracking  Self-Occluding  Objects  with  Lay¬ 
ered  Templates 

When  the  motion  of  an  object  like  the  hand  is  sampled  at  a  high  frame  rate, 
the  occlusion  relations  between  its  bodies  hardly  ever  change.  When  they 
do,  the  change  can  be  predicted  from  the  kinematic  model.  This  observa¬ 
tion  is  exploited  in  Chpt.  3  to  remove  the  estimation  of  occlusion  from  the 
tracking  problem,  leaving  only  the  registration  of  partly  occluded  templates. 
The  result  is  a  layered  representation  of  self-occlusion  that  is  dynamically 
updated  by  the  kinematic  model.  A  set  of  rules  for  hand  template  ordering 
are  developed  through  an  analysis  of  planar  kinematic  chains. 

The  registration  framework  from  Chpt.  2  is  extended  to  the  overlapping 
template  case  through  the  introduction  of  window  functions  that  mask  off 
the  contributions  of  occluded  templates.  The  presence  of  window  functions 
complicates  the  derivation  of  the  residual  Jacobian.  However,  the  structure 
of  the  layered  templates  can  be  expressed  in  a  window  tree,  and  analyzed  to 
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Figure  1.2:  Two  finger  self-occlusion  experiment  from  Chpt.  4.  (a)  Hand 
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Figure  1.3:  A  sample  graphical  environment  for  a  3D  mouse.  The  3D  cursor 
is  at  the  tip  of  the  “mouse  pole”,  which  sits  atop  the  ground  plane. 

plication,  to  test  its  practical  usefulness  as  an  input  device.  The  resulting 
non-invasive  interface  gives  the  user  control  over  a  3D  cursor  in  a  graphical 
environment,  using  images  from  a  single  calibrated  camera.  Figure  1.3  shows 
sample  output  from  the  interface. 

In  the  hnal  experiment,  described  in  Sec.  4.4,  an  off-line  version  of  the 
DigitEyes  system  was  used  to  test  the  self-occlusion  framework  of  Chpt.  3.  A 
75  frame  image  sequence  of  two  hngers  undergoing  signihcant  self-occlusion 
was  successfully  tracked. 
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1.4  Contributions 

This  dissertation  makes  five  main  contributions: 

1.  Analysis  of  the  application  of  kinematic  models  to  visual  tracking  of 
articulated  objects,  addressing  Jacobian  singularities  and  sensitivity,  as 
well  as  techniques  for  efficient  Jacobian  computation. 

2.  The  hrst  experimental  demonstration  of  real-time  tracking  (at  speeds  of 
up  to  10  Hz)  of  a  high-DOF  articulated  object  (a  27  DOF  hand  model), 
using  both  monocular  and  stereo  image  sequences  of  unadorned,  un¬ 
marked  hands  [46,  48]. 

3.  Application  of  the  DigitEye.s  sensor  to  the  3D  mouse  user-interface 
problem,  demonstrating  the  feasibility  of  3D  human  sensing  at  reason¬ 
able  accuracy  levels  using  currently- available  hardware  [47]. 

4.  The  identihcation  of  a  local  ordering  invariant  for  self-occluding  objects, 
an  analysis  of  its  existence  conditions,  and  the  design  of  a  tracking 
algorithm  for  self-occluding  motion  [49] . 

5.  The  hrst  experimental  demonstration  of  nontrivial  3D  articulated  ob¬ 
ject  tracking  in  the  presence  of  self-occlusion  [50]. 

These  results  extend  previous  techniques  in  computer  vision  for  rigid  body 
tracking  and  demonstrate  the  feasibility  of  vision-based  3D  human  motion 
sensing. 
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Chapter  2 

Tracking  with  Kinematic 
Models 


The  motion  of  an  articulated  object  like  the  hand  is  determined  by  its  skele¬ 
ton.  A  camera  can  only  observe  the  skeleton  indirectly,  however,  through  its 
effect  on  the  skin.  Skin  and  clothing  deform  during  hand  and  body  motion, 
producing  nonrigid  effects  in  an  image  sequence.  The  magnitude  of  these 
nonrigid  components  is  small,  however,  compared  to  the  effects  of  rigid,  ar¬ 
ticulated  body  motion.  This  dissertation  treats  nonrigidity  as  unmodeled 
noise  in  the  measurements  of  rigid,  articulated  objects.  Experimental  hand 
tracking  results,  presented  in  Chpt.  4,  demonstrate  the  efficacy  of  this  as¬ 
sumption.  They  are  corroborated  by  experimental  results  for  body  track¬ 
ing  [23,  29],  which  make  a  similar  assumption. 

2.1  The  Role  of  Kinematics  in  Visual  Tracking 

The  use  of  kinematic  models  is  vital  for  3D  tracking.  As  an  example,  consider 
the  problem  of  estimating  the  pose  of  the  hrst  huger  in  the  image  of  Fig.  1.2. 
The  true  huger  pose  and  its  projection  into  the  image  are  shown  with  a 
line  drawing  in  Fig.  2.1  (a).  The  line  drawing  is  a  useful  abstraction  of  the 
geometric  information  contained  in  the  image.  For  simplicity,  assume  that 
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Figure  2.1:  (a)  Stick  drawing  of  image  features  and  model  for  the  first  finger 
in  Fig.  1.2  and  (b)  two  models  with  different  kinematics  that  produce  the 
same  image. 

the  huger  lies  in  a  plane  in  space,  and  the  camera  model  is  orthographic.^ 
From  the  geometry  of  hgure  (a),  it  is  clearly  impossible  to  determine  the  3D 
pose  of  links  ab,  be,  and  cd  from  the  image  points  {a',  b',  c',  d'}  without  a 
kinematic  model.  In  fact,  for  any  sample  plane  in  3D  there  exists  a  huger 
conhguration  that  produces  the  given  image.  Fig.  2.1  (b)  gives  one  example. 
A  unique  solution  is  possible  only  when  the  link  lengths  are  known.  Only  in 
this  case  is  the  orientation  of  a  link  along  the  camera  axis  determined  by  its 
projection  in  the  image. 

The  example  in  Fig.  2.1  also  illustrates  the  diherence  between  errors  in 
registration  and  errors  in  3D  pose  (state)  estimates.  Registration  refers  to 
the  alignment  between  an  image  and  the  image  plane  projection  of  a  3D 
model.  As  Fig.  2.1  (b)  illustrates,  it  is  easy  to  achieve  zero  registration  error 
without  a  kinematic  model  for  any  sample  plane  position,  by  aligning  the 
projections  of  {a,b,c,d}  with  {a',  b',  c',  d'}.  The  corresponding  pose  error 
can  be  arbitrarily  large,  however,  as  the  sample  plane  rotates  away  from  the 

^In  orthographic  projection,  all  rays  from  the  scene  to  the  camera  are  parallel. 
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true  finger  plane.  Now  suppose  that  a  kinematic  model  is  available,  as  in 
Fig.  2.1  (a),  but  that  the  model  itself  has  some  error.  When  the  model  errors 
are  small,  the  pose  error  will  also  be  small.  The  registration  error  will  be 
nonzero  in  this  case,  as  no  conhguration  of  the  incorrect  model  will  match 
the  image  exactly.  A  kinematic  model  makes  it  possible  to  extrapolate  image 
registration  into  three  dimensions.  The  quality  of  this  extrapolation  depends 
on  the  accuracy  of  the  model. 

There  are  two  other  sources  of  3D  pose  information  besides  a  kinematic 
model:  shading  and  stereo.  The  shading  in  an  image  of  the  hand  varies  with 
its  spatial  orientation.  These  intensity  changes  carry  information  about  the 
3D  pose  of  the  palm  and  hngers.  Shading  cues  are  an  important  component 
of  human  perception,  but  exploiting  them  in  a  vision  algorithm  is  known  to 
be  extremely  challenging.  In  hand  images,  shadows  and  lighting  variations 
make  it  difficult  to  interpret  intensity  changes  correctly.  As  a  result,  it  is 
unlikely  that  the  accuracy  of  pose  estimation  due  to  shading  alone  would 
exceed  that  available  from  the  kinematics. 

Stereo  is  the  second  alternative  approach  to  pose  estimation,  for  links 
that  are  visible  in  two  or  more  camera  images.  In  stereo,  triangulation  with 
corresponding  pairs  of  image  points,  such  as  {ai,  a2}  in  Fig.  2.2,  produce  3D 
estimates  of  {a,b,c,d}.  Stereo  is  inadequate  by  itself,  however,  when  a  link  is 
not  visible  in  both  views  due  to  occlusion,  a  common  occurrence  in  practice. 
But  suppose  that  a  kinematic  model  is  available  in  addition  to  stereo.  In 
this  case,  localizing  three  of  the  points  by  stereo  determines  the  plane  of 
the  huger,  and  the  position  of  the  fourth  point  can  be  determined  from  a 
single  view.  This  illustrates  another  key  feature  of  the  kinematic  model:  it 
captures  redundancy  in  the  measurements,  which  leads  to  an  overdetermined 
estimation  problem. 

Kinematic  models  play  three  main  roles  in  tracking.  First,  they  param¬ 
eterize  the  DOFs  on  the  object,  and  provide  a  mathematical  representation 
for  the  output  of  the  tracking  algorithm —  a  trajectory  in  state  space.  Sec- 
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Figure  2.2:  Use  of  an  additional  stereo  image  to  reconstruct  the  3D  pose  of 
the  huger  depicted  in  Fig.  2.1  (a). 

ond,  they  express  constraints  on  the  motion  of  the  rigid  bodies  making  up 
the  articulated  object.  These  constraints  lead  to  an  over-determined  estima¬ 
tion  problem  in  the  image  measurements,  which  is  desirable  for  good  noise 
properties.  Third,  the  kinematics  also  constrain  the  possible  occlusions  be¬ 
tween  the  rigid  bodies.  Kinematic  analysis  plays  an  important  role  in  the 
development  of  tracking  algorithms  for  self-occluding  motion  in  Chpt.  3. 

This  chapter  begins  with  a  brief  description  of  the  mathematical  founda¬ 
tions  of  kinematic  modeling.  These  representations  originated  in  the  robot 
manipulation  literature,  but  have  been  adapted  slightly  to  meet  the  require¬ 
ments  of  visual  tracking.  This  presentation  is  signihcantly  more  complete 
than  any  that  has  appeared  in  the  visual  tracking  literature  to  date.  The 
application  of  kinematic  modeling  techniques  is  illustrated  for  the  hand.  The 
resulting  kinematic  hand  model  is  employed  throughout  this  thesis.  Calibra¬ 
tion  of  kinematic  and  camera  models  are  described,  along  with  the  effect  of 
their  errors. 
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The  second  half  of  the  chapter  describes  the  incorporation  of  kinematic 
models  into  tracking  algorithms.  The  kinematics  provide  a  forward  model 
for  the  object,  generating  predicted  images  as  a  function  of  the  estimated 
state.  This  chapter  addresses  the  geometric  component  of  the  forward  model, 
and  ignores  the  effects  of  occlusion.  In  this  chapter,  every  rigid  body  in  the 
model  is  assumed  to  be  completely  visible  to  the  camera.  The  forward  model 
interacts  with  the  input  image  through  a  residual  error  measure.  Minimizing 
the  residual  through  gradient-based  algorithms  brings  the  projection  of  the 
model  into  alignment  with  the  input  images. 

The  image  intensities  generated  by  the  object  determine  the  measure¬ 
ments  that  are  available  for  tracking,  and  therefore  the  form  of  the  resid¬ 
ual  error.  Two  residual  errors  are  examined  here.  The  hrst  is  a  general 
template-based  residual  that  can  be  applied  to  arbitrary  articulated  objects. 

The  second  residual  is  derived  from  geometric  line  and  point  features  that  ap¬ 
proximate  the  template  residual  in  the  case  of  objects,  like  hands  and  bodies, 
made  up  of  cylindrical  links.  The  feature  residual  is  a  closed  form  expres¬ 
sion  that  is  amenable  to  analysis  and  real-time  implementation  on  modest 
computing  hardware. 

2.2  Kinematic  Modeling  of  Articulated  Objects 

I  employ  standard  kinematic  modeling  techniques  from  robotics  [59]  to  rep¬ 
resent  skeletal  constraints  for  tracking.  These  models  have  been  used  for 
decades  to  solve  robot  control  and  path  planning  problems.  They  have 
good  theoretical  properties  and  support  efficient  on-line  algorithms.  Denavit- 
Hartenberg  notation,  for  example,  provides  a  standard  description  for  kine¬ 
matic  chains  like  the  huger.  This  notation  has  already  been  employed  in 
hand  models  for  computer  graphics  [52],  but  has  not  been  used  explicitly 
in  hand  or  body  tracking  to  date.  One  of  the  goals  of  this  thesis  is  to  ex¬ 
plore  the  connections  between  articulated  tracking  and  robot  control  more 
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carefully  than  previous  authors.  For  example,  historical  robot  control  issues 
like  kinematic  singularities  have  close  parallels  in  hand  tracking,  as  I  will  de¬ 
scribe  later  in  Sec.  2.5.3.  Developing  these  parallels  makes  techniques  from 
the  robotics  literature  available  for  articulated  tracking  analysis. 

All  previous  work  on  3D  human  tracking  employed  some  form  of  kine¬ 
matic  model.  The  two  earliest  systems,  by  O’Rourke  and  Badler  [42]  and 
Hogg  [23],  predated  the  widespread  popularization  of  robot  kinematic  mod¬ 
els  by  Paul  [43].  They  employed  their  own  customized  kinematic  representa¬ 
tions.  The  use  of  robot  kinematic  models  for  human  body  tracking  was  hrst 
proposed  by  Yamamoto  and  Koshikawa  in  [72].  This  work  did  not  present 
a  detailed  modeling  framework,  however,  but  relied  on  a  separate  software 
package  for  kinematic  computations. 

The  kinematic  models  described  in  this  section  form  the  basis  for  all  of 
the  tracking  algorithms  in  this  thesis.  Mathematical  representations  of  object 
kinematics  are  presented  here  in  detail.  Following  this  description,  a  kine¬ 
matic  hand  model  is  derived  from  an  anatomical  study.  This  illustrates  both 
the  usefulness  of  the  modeling  framework  and  the  specihc  concerns  of  kine¬ 
matic  modeling  for  visual  tracking.  Models  must  be  calibrated  before  they 
can  be  used,  and  the  calibration  process,  along  with  the  effects  of  calibration 
errors,  is  described  at  the  end  of  the  section. 


2.2.1  Coordinate  Frames  and  Transformations 

An  articulated  object  is  made  up  of  rigid  bodies,  called  links,  connected 
by  joints.  Fach  link  has  its  own  coordinate  frame  in  the  kinematic  model, 
and  pairs  of  link  frames  are  connected  by  coordinate  transformations.  A 
coordinate  transform  from  frame  i  to  frame  j,  written  Tj,  is  specihed  by  a 
rotation  matrix  Rj  and  translation  vector  dj,  arranged  in  a  4x4  homogeneous 
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Camera 


Figure  2.3:  Illustration  of  the  basic  coordinate  frames  in  the  kinematic  model, 
transformation  matrix:^ 

0^  1 

The  transform  satishes  the  relation  =  Tjpj,  where  p^  and  Pj  denote  the 
coordinates  with  respect  to  frames  i  and  j  of  the  world  point  p.  Each  p^  is  a 
4x1  vector  with  components  [xiyiZil],  The  3D  configuration  of  an  object  like 
the  hand  is  determined  by  the  position  of  each  of  its  links  with  respect  to  a 
world  coordinate  frame.  One  or  more  cameras,  also  positioned  with  respect 
to  the  world  frame,  provide  images  from  which  the  hand  conhguration  is 
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in  the  model.  It  is  convenient  to  add  an  additional  shape  coordinate  frame 
to  each  link,  which  positions  the  visible  geometry  relative  to  the  link  frame. 
Having  an  additional  frame  is  useful,  as  the  coordinate  frame  choice  which  is 
best  for  the  kinematic  description  may  not  be  the  best  for  shape  modeling. 
The  choice  of  shape  frame,  like  the  choice  of  link  coordinates,  depends  on 
the  application.  The  specihc  choices  made  in  hand  modeling  are  described 
in  the  next  section. 

A  series  of  links  connected  by  joints  forms  a  kinematic  chain.  The  posi¬ 
tion  of  any  link  in  the  chain  can  be  obtained  by  multiplying  transformation 
matrices.  For  example,  the  position  of  the  link  3  frame  in  Fig.  2.3  with  re¬ 
spect  to  the  camera  is  given  by  =  TJfT^TjT^.  Joints 

are  modeled  by  parameterAed  coordinate  transformations,  Tl(v),  called  joint 
transforms.  A  joint  transform  has  the  form  of  Fqn.  2.1,  but  is  a  matrix  func¬ 
tion  of  a  vector  v  of  kinematic  parameters,  such  as  joint  angles  and  link 
lengths. 

Link  frames  and  joint  transforms  make  up  the  topological  part  of  the 
kinematic  model —  they  specify  the  number  of  rigid  bodies  and  their  inter¬ 
connections.  The  topological  part  of  a  human  kinematic  model  comes  directly 
from  basic  anatomy.  A  huger,  for  example,  consists  of  three  phalanges  (rigid 
links)  connected  in  series  by  the  two  knuckle  joints.  Kinematic  parameters 
for  the  joint  transforms  make  up  the  parametric  part  of  the  kinematic  model. 
They  consist  of  the  object’s  DOFs  and  any  hxed  model  parameters. 

The  two  types  of  joint  transforms  used  in  this  thesis  are  spatial  transforms 
and  Denavit-Hartenberg  transforms.  Spatial  transforms  model  the  six  DOFs 
between  two  link  frames  that  are  not  in  physical  contact.  It  is  used  in  the 
hand  model  to  position  the  palm  relative  to  the  world  frame.  I  use  quater¬ 
nions  to  represent  the  rotational  part  of  the  spatial  transform.  Quaternions 
encode  the  axis-angle  representation  of  rotation  with  four  parameters.'^  Three 


^See  [38]  for  general  information  about  quaternions  and  spatial  transforms. 
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parameter  representations,  like  Euler  angles,  have  singularities  at  which  their 
Jacobian  loses  rank,  making  tracking  more  difficult.  These  singularities  are 
not  a  natural  result  of  the  kinematics,  but  an  artifact  of  the  parameter¬ 
ization.  Since  an  object  like  the  hand  may  achieve  an  arbitrary  pose  with 
respect  to  a  given  camera,  it  is  difficult  to  ensure  that  singular  conhgurations 
are  avoided.  Quaternions  are  the  minimal  singularity-free  representation  of 
the  rotation  group  [60].  They  have  a  long  history  of  use  in  satellite  con¬ 
trol  [69],  and  more  recently  in  vision  [21]  and  computer  graphics  [58].  The 
resulting  spatial  transform  has  seven  parameters. 

Since  the  four  quaternion  variables  are  not  a  minimal  description  of  ro¬ 
tation,  they  are  subject  to  a  unit  norm  constraint  that  reduces  their  DOFs 
to  three.  Specihcally,  a  quaternion  vector  Q  must  satisfy  Q^Q  =  1  at  all 
times.  As  a  result,  quaternion-based  tracking  is  technically  a  constrained 
estimation  problem.  I  follow  the  practice  described  in  [22]  of  expressing  the 
quaternion  rotation  matrix  in  a  form  that  includes  the  normalization.  The 
resulting  quaternion  estimate  is  re-normalized  periodically  to  prevent  the 
accumulation  of  numerical  errors. 

When  two  links  are  physically  connected  by  a  joint,  the  coordinate  trans¬ 
formation  between  them  must  have  fewer  than  six  DOFs.  The  Denavit- 
Hartenberg  (DH)  notation  [13]  provides  a  consistent  parameterization  in  this 
case.  Each  DH  transform  is  composed  of  four  basic  transformations: 

=  Rot^(6'QTrans^(ffi)Trans^(aQRot^(o;Q  ,  (2.2) 

where  Rot(-)  represents  a  rotation  about  a  given  axis,  and  Tran(-)  a  trans¬ 
lation  along  it.  See  [59],  Fig.  3-4,  for  an  illustration  of  the  general  DH 
transform,  which  is  widely  used  in  robotics.  The  parameters 
along  with  the  choice  of  the  link  frame,  can  be  used  to  model  all  lower  pair 
joints  of  interest.  The  DH  parameters  can  be  divided  into  two  groups:  state 
variables,  which  represent  the  DOFs  of  the  object  at  the  joint,  and  hxed 
parameters,  which  describe  the  object’s  geometry  and  are  unchanged  by  its 
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Metacarpals 


Trapeziometacarpal  Joint 
3  DOF 


Metacarpocarpal  Joints 
1  DOF  each  on  digits  4  &  5 


Carpals 


Radius 


Ulna 


Figure  2.4:  Hand  skeleton  and  joints.  This  is  Fig.  1  from  [61],  used  with 
permission. 

motion. 

The  kinematic  representation  described  above  can  be  applied  to  a  wide 
variety  of  objects,  from  humans  to  industrial  robots.  In  the  next  section,  it 
is  used  to  develop  a  hand  kinematic  model,  which  is  employed  in  all  of  the 
tracking  experiments  in  this  thesis. 


2.2.2  A  Kinematic  Hand  Model 

Kinematic  models  for  visual  tracking  need  only  describe  motion  which  a 
camera  can  measure.  As  a  result,  they  can  be  considerably  simpler  than  those 
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Finger  chains  are  built  up  from  revolute  joints,  which  constrain  two  links 
to  a  single  rotational  DOF  around  the  joint  axis.  Figures  2.3  and  2.5  illustrate 
the  link  frame  assignments  for  the  revolute  joint  model.  The  frame  for  link  i 
is  chosen  so  that  0^  (in  Fqn.  2.2)  is  the  revolute  joint  angle,  and  the  negative 
X  axis  passes  through  the  joint  center  of  link  i  —  1.  With  this  choice  of 
coordinates,  the  DH  kinematic  parameters  di  and  ai  are  zero,  and  equals 
the  link  length.  Making  these  substitutions  in  Fqn.  2.2  gives  the  revolute 
joint  transform 

=  Rot^(6'*)Trans^(T*)  ,  (2.3) 

where  Li  is  the  length  of  the  Th  link.  The  link  lengths  are  the  hxed  param¬ 
eters  in  the  kinematic  model.  They  are  determined  before  tracking  begins 
through  a  calibration  process  described  in  Sec.  2.2.3.  Once  they  have  been 
specihed,  the  state  variables  6i  completely  determine  the  conhguration  of  the 
huger  chains.  Fach  huger  contributes  four  joint  variables  to  the  state  vector. 
The  arrows  in  Fig.  2.5  illustrate  the  axes  of  the  revolute  joints  of  the  hngers 
and  thumb.  The  two  DOFs  at  each  huger  MCP  joint  are  modeled  by  a  pair 
of  revolute  joint  transforms,  each  with  a  single  DOF.  Arbitrary  compound 
joints  can  be  described  in  this  manner.  The  shape  frame  for  huger  links  is 
positioned  at  the  joint  center,  immediately  following  the  link  rotation.  Thus 
the  transform  between  link  and  shape  frames  is  given  by 

T,*  =  (2.4) 

Table  2.1  presents  the  kinematic  model  of  the  palm  and  hrst  huger  in 
its  full  detail.  This  is  an  excerpt  from  the  table  in  Appendix  A  containing 
the  complete  hand  kinematics.  The  table  is  a  formatted  version  of  a  hie  the 
DigitEyes  tracking  system  reads  in  when  building  its  kinematic  model.  Fach 
frame  is  numbered,  and  its  entry  in  the  column  titled  Next  is  a  pointer  to 
the  frame  that  follows  it  in  the  chain.  These  pointers  specify  the  topology 
of  the  kinematic  model.  Joint  transforms  are  automatically  created  for  links 
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Frame 

Geometry 

9 

d 

a 

a 

Shape  (in  mm) 

Next 

0 

Palm 

0.0 

0.0 

0.0 

0.0 

X  56,  y  86,  z  15 

1  8  ... 

1 

7r/2 

0.0 

38.0 

—tt/2 

2 

2 

0.0 

-31.0 

0.0 

7r/2 

3 

3 

qr 

0.0 

0.0 

tt/2 

4 

4 

Finger  1  Link  0 

qs 

0.0 

45.0 

0.0 

Rad  10.0 

5 

5 

Finger  1  Link  1 

qg 

0.0 

26.0 

0.0 

Rad  10.0 

6 

6 

Finger  1  Link  2 

qio 

0.0 

24.0 

0.0 

Rad  9.0 

7 

7 

Finger  1  Tip 

0.0 

0.0 

0.0 

0.0 

Rad  9.0 

Nil 

Table  2.1:  Kinematic  and  shape  parameters  for  the  palm  and  hrst  hnger. 
State  variables  are  denoted  qp  where  qo-qe  are  the  palm  pose  and  qr-qio  are 
the  rotation  angles  for  the  hrst  hnger. 


that  are  connected  in  the  table.  Links  with  an  entry  in  the  Geometry  column 
have  features  that  can  be  tracked.  Frames  with  a  state  variable  in  the  9 
column  have  a  revolute  joint,  while  other  frames  have  constant  transforms. 
For  example,  two  constant  DH  transforms,  in  frames  1  and  2,  are  used  to 
position  the  base  of  the  hrst  hnger  (at  frame  3)  with  respect  to  the  palm. 
Note  that  the  d  and  a  parameters  are  nonzero  only  for  constant  transforms, 
in  keeping  with  Fqn.  2.3.  The  Shape  column  describes  the  visible  geometry 
of  the  link,  which  is  used  to  render  a  model  hand  for  visualization  purposes. 
The  other  three  hngers  are  similar  to  the  hrst,  and  are  contained  in  the 
Appendix. 

Like  the  hngers,  the  thumb  model  is  also  constructed  from  the  revolute 
joints  of  Fqn.  2.3.  The  thumb  is  the  most  difhcult  digit  to  model,  due  to 
its  great  dexterity  and  intricate  actuation.  It  has  hve  DOFs  (see  Fig.  2.4,) 
but  one  DOF  at  the  trapeziometacarpal  joint  is  dependent  on  the  others. 
It  acts  to  rotate  the  thumb  longitudinally,  bringing  it  into  opposition  with 
the  hngers  during  grasping. 
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This  effect  is  modeled  by  placing  an  additional  revolute  DOF  at  the  thumb 
MP  joint,  as  shown  in  Fig.  2.5.  Placing  the  oppositional  DOF  there,  rather 
than  at  the  base,  helps  limit  its  impact  on  the  model.  This  choice  was 
motivated  by  the  experience  of  Rijpkema  and  Girard  in  their  grasp  modeling 
system  [52].  They  employed  a  similar  thumb  model  and  obtained  realistic 
computer  graphic  animations  of  hand  grasps.  Aside  from  this  extra  joint,  the 
thumb  model  is  quite  similar  to  that  of  the  fingers,  with  two  DOFs  at  the 
trapeziometacarpal  joint  and  one  each  at  the  thumb  MP  and  IP  joints.  The 
thumb  occupies  frames  29  through  36  in  the  kinematic  table  of  Appendix  A. 

Real  hands  deviate  from  the  above  modeling  assumptions  in  three  main 
ways.  First,  most  fingers  are  slightly  nonplanar.  This  deviation  could  be 
modeled  by  allowing  nonparallel  joint  axes,  but  the  planar  approximation 
has  proved  to  be  adequate  in  practice.  Second,  the  last  two  joints  of  the 
finger  (the  distal  and  proximal  interphalangeal  joints)  are  driven  by  the  same 
tendon  and  are  not  capable  of  independent  actuation.  It  is  simpler  to  include 
these  DOFs  separately,  however,  than  to  model  the  complicated  angular 
relationship  between  them.  The  third  deviation  stems  from  the  rigid  palm 
assumption,  which  ignores  the  metacarpocarpal  joints  at  the  base  of  fingers 
4  and  5  (see  Fig.  2.4).  When  gripping  an  object,  like  a  baseball,  these  joints 
permit  the  palm  to  conform  to  its  surface,  causing  the  anchor  points  to  move 
by  tens  of  millimeters.  For  free  motions  of  the  hand  in  space,  however,  this 
deviation  is  small  enough  to  ignore. 

The  full  hand  model  consists  of  16  rigid  bodies  and  a  28  dimensional  state 
vector.  The  kinematic  model  described  above  is  fairly  standard,  and  closely 
related  models  have  appeared  in  the  user-interface,  computer  graphics,  and 
biomechanics  literature  [61,  52,  66].  The  most  common  difference  between 
kinematic  hand  models  is  in  their  treatment  of  the  metacarpophalangeal  and 
trapeziometacarpal  joints.  This  dissertation  does  not  explore  these  subtleties 
of  hand  modeling  in  any  significant  detail.  Kinematic  modeling  issues  are 
secondary  to  the  more  basic  concerns  of  real-time  tracking  and  occlusion- 
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handling  which  are  the  focus  of  this  research.  Once  a  solid  foundation  for 
visual  articulated  object  tracking  has  been  established,  the  development  of 
accurate  kinematic  models  for  specihc  applications  can  be  explored  in  earnest. 

Articulated  objects  like  the  hand  are  subject  to  other  motion  constraints 
besides  the  kinematic  joints  which  are  the  focus  of  this  chapter.  Regions 
of  the  state  space  may  be  inaccessible  to  the  model,  for  example,  due  to 
joint  limits  and  non-interpenetration.  This  leads  to  inequality  constraints 
on  the  state  estimates.  Moreover,  as  a  result  of  actuation  and  motor  control 
patterns,  groups  of  states  will  often  be  coupled  during  characteristic  motions. 
For  example,  the  hngers  will  follow  similar  state  trajectories  in  making  a  hst. 
Since  these  constraints  act  on  the  state  space  at  a  level  above  the  basic 
kinematics,  they  were  not  addressed  in  this  work. 

Kinematic  models  for  the  entire  body  could  be  developed  using  the  meth¬ 
ods  described  in  this  section.  In  fact,  the  body’s  kinematics  are  topologically 
quite  similar  to  those  of  the  hand,  with  the  torso  playing  the  role  of  the  palm 
and  the  arms  and  legs  taking  on  the  role  of  the  hngers.  Like  the  hngers,  the 
kinematic  chains  of  the  arms  and  legs  are  predominantly  planar.  One  point 
of  departure  is  the  much  greater  hexibility  of  the  torso  compared  to  the  hand 
as  a  result  of  the  spinal  column. 

Adopting  kinematic  representations  from  robotics  makes  it  possible  to 
track  any  articulated  object  with  the  same  mathematical  framework.  This 
generality  is  rehected  in  the  software  implementation  of  the  DigitEyes  track¬ 
ing  system.  Any  object  that  can  be  modeled  using  the  techniques  of  this 
chapter  can  be  tracked  simply  by  changing  the  hie  illustrated  in  Table  2.1. 
This  capability  is  exploited  in  Chpt  4,  where  diherent  subsets  of  the  whole- 
hand  model  are  employed  in  separate  experiments.  To  use  a  kinematic  model 
for  tracking,  its  hxed  parameters  must  be  determined  from  the  actual,  phys¬ 
ical  hand.  This  is  accomplished  in  the  kinematic  calibration  stage  described 
next. 
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2.2.3  Kinematic  Model  Calibration  and  Errors 

Calibrating  the  kinematic  model  by  setting  its  fixed  parameters  is  the  most 
challenging  aspect  of  hand  modeling.  The  21  joint  axes,  15  link  lengths, 
and  5  anchor  points  of  the  hand  were  determined  in  a  three  stage  off-line 
calibration  process.  First,  the  joint  axes  were  initialized  following  the  hnger 
and  thumb  anatomy  of  the  previous  section.  Next,  the  link  lengths  were 
determined  in  two  steps.  In  the  hrst  step,  the  distances  between  the  three 
knuckles  in  each  hnger  were  measured  with  a  ruler  at  the  surface  of  the  skin, 
to  give  a  rough  length  for  each  link.  Then,  the  resulting  kinematic  model  was 
ht  to  each  hnger  separately  in  two  images  taken  with  a  calibrated  camera: 
hnger  outstretched  and  hnger  curled.  The  link  lengths  were  tuned  manually 
until  the  projected  hand  model  matched  the  images.  Obtaining  link  lengths 
for  the  hngers  and  thumb  took  about  four  hours. 

Finally,  the  anchor  points  were  determined  in  the  last  stage.  They  are  the 
most  challenging  parameters  to  calibrate,  as  they  are  difhcult  to  measure  on 
real  hands,  and  difhcult  to  identify  in  images.  The  anchor  point  calibration 
strategy  exploited  the  known  link  lengths  from  the  previous  stage,  and  three 
images  of  the  back  of  the  hand  with  hngers  extended:  one  looking  straight 
down  (called  image  1)  and  two  at  oblique  angles  (images  2  and  3.)  The 
hrst  step  was  the  arbitrary  assignment  of  the  palm  origin  to  the  MCP  joint 
center  of  the  hrst  hnger.  Measurements  with  a  ruler  gave  rough  estimates  of 
the  anchor  points  with  respect  to  this  frame  in  the  x  and  y  axis  directions 
(parallel  to  the  plane  of  the  palm,  with  the  y  axis  pointing  down  the  hrst 
hnger.) 

Given  these  preliminary  anchor  points,  an  interactive  version  of  the  track¬ 
ing  system  was  used  to  ht  the  complete  hand  model  to  image  1.  After  a  few 
iterations,  the  anchor  points  were  “released,”  freeing  each  hnger  and  thumb 
to  move  independently  of  the  palm.  This  allowed  the  base  of  each  digit  to 
shift  until  the  error  in  its  tip  and  edge  positions  was  minimized.  The  original 
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anchor  point  and  the  current  base  of  each  huger  and  thumb  were  overlaid 
on  the  hand  image.  Estimation  was  halted  after  a  few  iterations,  and  the 
original  anchor  points  were  manually  adjusted  to  agree  with  the  new  base 
positions.  This  procedure  was  repeated  with  the  two  oblique  images,  to  lo¬ 
calize  the  anchor  points  along  the  2;  axis  (out  of  the  plane  of  the  palm.)  It 
took  a  few  hours  to  calibrate  the  anchor  points.  The  calibration  procedure 
described  above  was  performed  once  for  my  right  hand,  and  the  resulting 
kinematic  model  was  used  in  all  subsequent  experiments.  It  is  presented  in 
Appendix  A  in  its  full  detail. 

The  calibration  goal  of  this  dissertation  was  to  obtain  a  useful  kinematic 
model  as  quickly  as  possible.  The  experimental  performance  of  this  model  on 
a  wide  variety  of  hand  images  indicates  that  this  goal  was  achieved.  However, 
calibration  is  likely  to  remain  a  nontrivial  component  of  any  future  model- 
based  articulated  object  tracking  system.  The  adequacy  of  the  hand  model 
calibration  is  discussed  further  in  Chpt.  4,  and  an  approach  to  automatic, 
on-line  calibration  is  discussed  in  Chpt.  6.  The  remainder  of  this  section 
presents  a  taxonomy  of  kinematic  model  errors,  and  describes  their  effect  on 
tracking  performance. 

Errors  can  occur  in  both  the  topological  and  parametric  parts  of  the 
kinematic  model.  Topological  errors,  like  incorrect  joint  axes,  are  the  result 
of  anatomical  deviations  from  the  model.  Eor  example,  if  a  huger  exhibits 
a  large  deviation  from  planarity,  the  joint  axes  of  the  planar  huger  model 
will  be  incorrect.  As  a  result,  it  will  be  impossible  to  set  the  state  variables 
so  that  the  huger  links  are  registered  with  the  image.  This  type  of  error  is 
easily  detected  by  overlaying  the  model  projection  on  the  image. 

Improper  calibration  can  also  produce  errors  in  the  link  lengths  and  an¬ 
chor  points  that  make  up  the  hxed  model  parameters.  The  ehect  of  incorrect 
link  lengths  is  particularly  striking.  If  the  links  are  too  long,  the  huger  de¬ 
velops  obvious  “kinks”  in  trying  to  ht  its  image.  If  the  links  are  too  short, 
the  model  huger  tip  never  reaches  its  match  in  the  image.  Errors  in  the  an- 
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chor  points  are  the  most  difficult  to  detect  and  correct,  as  they  may  not  be 
apparent  unless  the  model  is  ht  to  the  image  under  a  wide  range  of  viewing 
angles. 

I  encountered  all  three  of  these  types  of  kinematic  errors  in  the  early 
stages  of  hand  modeling.  They  proved  to  be  fairly  easy  to  detect  using  an 
interactive  tracking  system.  The  system  I  developed  made  it  possible  to  ht 
hand  models  to  images,  see  the  result  in  3D  from  an  arbitrary  viewpoint,  and 
quickly  modify  the  joint  angles  to  observe  their  effect  on  registration.  The 
importance  of  having  an  interactive  system  when  developing  these  models 
cannot  be  over-emphasized.  With  this  tool,  the  space  of  possible  models 
could  be  searched  efficiently  and  problems  diagnosed  quickly.  The  interactive 
system  is  described  in  more  detail  in  Chpt.  4. 

A  calibrated  kinematic  model  can  be  viewed  as  a  mapping  from  the  state 
space  to  the  3D  positions  of  the  shape  frames,  which  contain  the  visible  sur¬ 
faces  of  the  links.  The  next  stage  in  this  mapping  is  the  projection  of  the  3D 
link  geometry  into  the  image  plane.  This  is  accomplished  through  a  camera 
model,  which  maps  points  from  the  shape  frames  into  image  coordinates. 


2.3  Camera  Modeling  and  Calibration 


As  with  the  kinematics,  cameras  can  also  be  modeled  by  transformations 
between  coordinate  frames.  The  imaging  geometry  of  a  pin-hole  camera 
is  modeled  by  a  projective  transform  between  the  camera  and  image  buffer 
coordinates  [16]: 


P 


C 

b 


0  Uo  0 
0  ay  Vo  0 

0  0  10 


(2.5) 


The  intrinsic  camera  parameters,  {«„,  ay,  Uq,  Cq},  dehne  the  scale  factors  and 
origin  for  the  camera’s  sensor  array.  The  image  coordinates  of  a  3D  point  Pc 
located  in  the  camera  frame  are  w  =  [xj^j zi,  ybl Zb\,  where  p;,  =  [xj,  yj,  zj,]  = 
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P^Pc-  Let  S[-]  denote  the  scaling  operator  that  returns  the  hrst  two  elements 
of  a  vector  divided  by  its  third.  Furthermore,  let  TJf  specify  the  camera 
position  with  respect  to  the  world  frame  (the  extrinsic  camera  model.)  The 
projection  of  a  world  point  p^,  into  the  camera  image  can  then  be  written 

w  =  S[P^T)fp^]  =  S[Pp^]  ,  (2.6) 

where  P  is  the  3x4  camera  projection  matrix. 

When  the  distances  between  points  on  an  object  of  interest  are  small 
compared  to  the  distance  to  the  camera,  the  perspective  projection  model 
can  be  approximated  by  orthographic  projection 

w  =  P^T-p^  =  Pp^  ,  (2.7) 


where 


0  0  Uo 

0  ay  0  Vo 


(2.8) 


is  an  orthographic  transform,  and  P  is  the  2x4  orthographic  projection  ma¬ 
trix.  The  fact  that  the  camera  and  kinematic  transformations  have  a  similar 
algebraic  form  makes  it  easy  to  combine  them  in  one  representational  frame¬ 
work. 

Camera  models  are  specihed  by  the  sets  of  intrinsic  and  extrinsic  parame¬ 
ters.  These  parameters  must  be  determined  in  a  calibration  stage  before  the 
model  can  be  employed  for  tracking.  I  used  Robert’s  calibration  algorithm, 
described  in  [53],  for  all  of  the  experiments  in  this  thesis.  The  algorithm 
uses  a  single  image  of  a  cube  of  known  size  to  determine  both  the  intrin¬ 
sic  and  extrinsic  camera  parameters.  The  procedure  has  two  stages:  First, 
the  user  manually  identihes  the  position  of  six  predetermined  points  in  the 
cube  image,  and  an  approximate  calibration  matrix  is  generated.  Second, 
the  approximate  model  is  rehned  in  an  iterative  stage  using  additional,  au¬ 
tomatically  detected  image  features  and  a  standard  numerical  minimization 
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package.  The  advantage  of  Robert’s  algorithm  is  that  the  image  features  are 
continuously  updated  in  the  iterative  stage  along  with  the  camera  model, 
reducing  the  effect  of  any  initial  errors  in  locating  the  image  points. 

Evaluating  the  accuracy  of  a  calibrated  camera  model  is  a  difficult  task. 
In  theory,  image  features  from  two  faces  of  the  cube  image  provide  sufficient 
geometric  constraints  for  calibration  (see  [16],  Sec.  3. 4. 1.3).  However,  since 
numerical  minimization  is  employed,  there  is  no  guarantee  that  the  stopping 
point  is  the  global  minimum.  A  partial  evaluation  of  the  calibration  accuracy 
was  obtained  when  a  pair  of  cameras  were  calibrated  for  stereo  experiments. 
In  this  case,  the  epipolar  lines  for  features  in  both  images  were  examined 
and  found  to  be  accurate  to  within  the  image  resolution.  Additional  exper¬ 
imental  evaluation  of  Robert’s  algorithm  is  described  in  [53].  An  advantage 
of  calibrating  with  a  cube  target,  as  opposed  to  the  series  of  grid  positions 
that  are  traditionally  employed,  is  that  multiple  cameras  with  convergent 
axes  can  be  easily  calibrated  with  respect  to  the  same  world  frame  (dehned 
within  the  cube.)  The  calibration  cube  was  manufactured  out  of  PVC  plastic 
to  a  tolerance  of  ±0.003  in.  on  all  dimensions,  by  K^T,  Inc. 

2.4  Tracking  Through  Template  Registration 

Visual  tracking  is  a  sequential  image  registration  ^yvohXeva.  The  state  estimate 
in  each  frame  minimizes  the  residual  error  between  the  projected  object 
model  and  the  image.  Different  tracking  approaches  are  distinguished  by 
the  choice  of  residual  function.  In  template  registration,  the  residual  error 
measures  the  intensity  difference  between  an  input  image  and  the  image 
predicted  by  the  kinematic  model.  A  set  of  templates  describe  the  image 
appearance  of  each  link.  The  position  of  each  template  in  the  image  is 
given  by  the  kinematic  and  camera  models  as  a  function  of  the  state.  State 
estimates  are  obtained  by  minimizing  the  residual  numerically. 

This  section  has  four  parts.  First,  deformation  functions  are  developed 
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that  map  templates  to  images  as  a  function  of  the  state.  Next,  deformations 
are  incorporated  into  a  Sum  of  Squared  Differences  (SSD)  residual  function. 
A  gradient-based  minimization  algorithm  is  described  in  the  third  part.  In 
the  last  part,  the  deformation  function  Jacobian  is  derived  and  its  computa¬ 
tion  is  discussed. 


Template 


Image 
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as  illustrated  in  Fig.  2.6.  The  template  plane  determines  the  position  in  the 
shape  frame  of  each  template  pixel. 

Given  the  approximate  pose  of  a  link  relative  to  the  camera,  the  appro¬ 
priate  template  plane  can  be  chosen  automatically.  The  number  of  required 
templates  is  a  function  of  the  shape  and  photometry  of  the  link.  For  cylindri¬ 
cal  links,  like  huger  phalanges,  a  single  view  may  be  enough,  while  an  object 
like  the  palm  or  body  torso  will  require  more.  The  number  of  required  views 
for  cylindrical  objects  can  be  reduced  signihcantly  by  allowing  the  template 
plane  to  rotate  around  the  axis  of  symmetry,  maintaining  a  frontal  camera 
orientation. 

The  template  plane  model  merges  geometric  and  photometric  aspects  of 
image  appearance  in  a  single  framework.  The  orientation  and  position  of  the 
template  plane  relative  to  the  camera  capture  the  effects  of  foreshortening 
and  rotation  on  the  image  of  the  link.  The  template  pixels  capture  intensity 
variations  at  a  hner  scale  resulting  from  the  shape  of  the  huger  phalanges. 
A  variety  of  features,  from  edges  to  textures,  can  be  employed  by  changing 
the  form  of  the  template. 

Given  the  state  of  the  hand,  the  image  appearance  of  each  link  can  be 
synthesized  by  projecting  its  template  plane  through  the  camera  model.  The 
combination  of  kinematic  and  camera  transforms  is  represented  by  a  de¬ 
formation  function  [51],  f(q,  s),  which  maps  template  coordinates  to  image 
coordinates  as  a  function  of  the  state.  If  s  =  [n  v]  denotes  a  template  pixel 
and  w  =  [x  y]  denotes  its  corresponding  image  pixel,  then  w  =  f  (q,  s).  This 
mapping  is  illustrated  in  Fig.  2.6  for  a  huger  tip  template.  The  deformation 
function  is  constructed  from  a  series  of  coordinate  transformations.  Let  the 
coordinate  axes  of  template  /j,  expressed  in  its  shape  frame,  make  up  the 
column  vectors  of  the  3x2  matrix  Fj.  Combining  this  with  Fqns.  2.2,  2.4, 
and  2.7  yields  the  orthographic  deformation  function 

f,(q,s)  =  PTi,(q)T'(q)r,s  . 


(2.9) 
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Deformable  template  models  have  appeared  in  previous  tracking  and 
registration  work.  Their  use  for  pattern  recognition  goes  back  at  least  to 
Widrow  [70].  In  1981,  Lucas  and  Kanade  [36]  proposed  an  image  registra¬ 
tion  scheme  using  afhne  deformations  that  has  become  a  standard  solution  to 
optical  flow  and  point  tracking  problems.  In  joint  work  with  Andy  Witkin, 
I  investigated  an  approach  to  2D  template  tracking  based  on  deformation 
models  [51].  In  our  approach,  the  arbitrary  (rigid  or  nonrigid)  motion  of  the 
pixels  was  assumed  to  be  the  result  of  an  unknown,  but  smooth,  deformation 
function.  This  unknown  deformation  was  approximated  by  its  truncated 
Taylor  Series,  resulting  in  a  family  of  polynomial  deformation  models.  I 
developed  a  real-time  system  on  an  SGI  GTX  workstation  that  used  these 
models  to  track  a  small  window  of  pixels,  selected  by  the  user,  through  an 
image  sequence.  A  related  hierarchy  of  2D  motion  models  was  published 
later  by  Bergen  et.  ah  [6].  The  kinematic  deformation  model  of  Eqn.  2.9  is 
a  natural  extension  of  this  earlier  work  to  a  3D  tracking  domain.  A  further 
extension  of  this  paradigm  occurs  in  Chpt.  3,  in  addressing  self-occluding 
motion. 

2.4.2  SSD  Residual  Error  Function 

The  residual  function  for  template  registration  measures  the  intensity  dif¬ 
ference  between  a  deformed  template  and  an  input  image.  I  employ  the 
standard  Sum  of  Squared  Differences  (SSD)  error  measure  between  hltered 
pixels.  In  the  SSD  approach,  both  the  input  image  and  the  templates  are 
convolved  with  a  hlter  and  subtracted,  squared  and  summed  to  obtain  the 
residual  error.  By  changing  the  hlter,  different  properties  of  the  image  can 
be  emphasized.  For  example,  using  a  Laplacian  of  Gaussian  (LOG)  hlter 
produces  a  residual  error  which  is  sensitive  to  edge  energy.  Using  Eqn.  2.9, 
the  residual  at  a  pixel  s  in  template  Ij  can  be  written 

= /(fi(q.s)) - t(s)  , 


(2.10) 
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where  /  and  Ij  are  the  hltered  input  image  and  template,  respectively.  The 
template  error  resulting  from  this  residual  choice  is  given  by 

E(q)  =  ^  J^  Rj(q,syds  =  ^  J^  [I(fj(q,s))  -  Ij(s)]^ds  .  (2.11) 

Each  template  in  the  object  model  contributes  an  error  term  of  the  form  of 
Eqn.  2.11. 

The  SSD  residual  is  one  possible  choice  from  a  large  class  of  image  sim¬ 
ilarity  measures  [55,  25].  It  is  a  traditional  choice  for  template  matching 
applications,  because  it  works  well  in  practice.  Any  differentiable  residual 
could  be  employed  in  Eqn.  2.11  to  measure  the  error,  and  the  rest  of  the 
framework  would  remain  unchanged. 

2.4.3  State  Estimation  by  SSD  Residual  Minimization 

The  residual  in  Eqn.  2.10  is  a  nonlinear  function  of  the  state  q.  There 
are  two  main  sources  of  nonlinearity:  trigonometric  terms  in  the  kinematic 
model  from  Eqn.  2.9,  and  intensity  variations  in  the  template  and  input 
images.  Use  of  a  perspective  camera  model  introduces  a  secondary  source  of 
nonlinearity.  The  kinematic  model  is  a  smooth  function  of  the  state.  SSD 
error  functions  are  also  observed  empirically  to  be  smooth  and  approximately 
quadratic  around  their  minima  [4].  As  a  result,  Eqn.  2.11  can  be  treated  as 
a  smooth  function  of  the  state  and  minimized  numerically  through  standard 
gradient-based  methods  [14].  The  use  of  continuous  variable  optimization 
techniques  is  one  of  the  key  distinctions  between  the  tracking  approaches  in 
this  thesis  and  [72],  and  the  earlier  works  of  O’Rourke  [42]  and  Hogg  [23]. 
These  optimization  techniques  make  it  possible  to  search  much  larger  state 
spaces  than  classical  interval  analysis  or  constraint  satisfaction  approaches. 

Given  an  error  function  like  Eqn.  2.11,  tracking  can  proceed  by  a  sim¬ 
ple  gradient  descent  minimization  algorithm.  If  Ek(-)  denotes  the  state- 
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dependent  error  for  input  image  R,  the  state  update  is  given  by: 


qfc  =  qfc-i 


dEk 

^  aq 


(qfc-i) 


(2.12) 


where  p  is  the  step  size.  The  update  step  can  be  iterated  when  the  inter-frame 
motion  is  large.  The  estimate  from  the  previous  frame,  possibly  modihed  by 
velocity-based  prediction,  serves  as  the  starting  point  for  minimization  in  the 
current  frame.  Sec.  2.5.3  discusses  the  use  of  more  sophisticated  minimization 
algorithms  than  gradient  descent. 

Differentiating  Eqn.  2.11  yields 


dE 

9q 


^j(q,s)-^(q,s)  ds 


„  di 

oq  ow 


(2.13) 


where  dRjjdq  denotes  the  re.sidual  Jacobian.  The  residual  Jacobian  is  a 
product  of  two  terms,  the  derivative  of  the  deformation  function,  and  the 
image  gradient.  Since  the  deformation  function  is  a  product  of  kinematic 
transforms  (see  Eqn.  2.9,)  its  derivative  must  take  the  form  of  a  kinematic 
Jacobian.  The  derivation  of  this  Jacobian  and  its  on-line  computation  are 
discussed  in  the  next  section.  The  Jacobian  maps  state  velocities  to  the  image 
plane  velocities  of  template  pixels.  It  follows  that  the  residual  Jacobian  at 
an  image  point  is  a  weighted  combination  of  the  kinematic  Jacobian  of  its 
associated  link  template  point. 

The  key  to  the  practical  success  of  the  gradient-based  minimization  ap¬ 
proach  is  a  high  image  sampling  rate,  which  limits  image  motion  between 
frames.  Templates  will  generate  useful  error  signals  only  when  they  “see” 
a  signihcant  portion  of  the  link  they  are  tracking,  making  it  important  to 
limit  motion  in  the  image  plane.  In  the  state  space,  a  region  of  convergence 
(ROC)  exists  around  the  global  minimum.  Interframe  motion  must  be  small 
enough  for  the  predicted  state,  which  determines  the  starting  point  for  mini¬ 
mization,  to  fall  within  the  ROC  at  each  image  [64].  Analyzing  the  required 
sampling  rate  is  difficult,  as  it  depends  on  the  object  state,  the  form  of  the 
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residual  error  measure,  and  the  image  properties.  However,  experimental 
results  in  Chpt.  4  indicate  that  image  motions  of  hve  to  ten  pixels  can  be 
handled  successfully,  corresponding  to  a  15  Hz  sampling  rate  under  normal 
hand  motion. 


2.4.4  Deformation  Function  Jacobians 

The  deformation  function  of  Eqn.  2.9  is  a  series  of  coordinate  transforma¬ 
tions.  As  a  result,  standard  techniques  from  robotics  (see  [59],  Sec.  5.1) 
can  be  employed  to  compute  its  Jacobian.  Let  Sj  be  a  pixel  in  template  Ij 
which  projects  to  Wj  in  the  image  plane.  Let  Pj  =  denote  the  point’s 

coordinates  in  the  shape  frame  of  link  j.  Suppose  further  that  link  frame  i 
has  a  revolute  joint  with  angle  0^  that  effects  the  position  of  frame  j.  Then 
the  basic  Jacobian  component,  dw^ldOi^  can  be  derived  as  follows. 

The  hrst  step  is  to  reorganize  Eqn.  2.9,  letting  Wj  denote  the  point  Pj 
in  world  coordinates  prior  to  camera  projection,  obtaining 


f,(q,s)  =  PW,  =  PTi;(q)p,  =  P[RS(q)p,  +  dS(q)l 


(2.14) 


where  RS  and  dS  are  the  rotation  and  translation  components  of  TS  ,  the 
position  of  link  j’s  shape  frame  in  world  coordinates. 

Separating  the  transform  for  Wj  into  components  before  and  after  frame 
i  and  differentiating  with  respect  to  time  yields 

w,  =  4iR;„(R''p,  +  dy  +  d;„i 

=  xR;„(R'>p,  +  dy  ,  (2.1.5) 

where  is  the  rotation  axis  for  joint  i  expressed  in  world  coordinates.  The 
Jacobian  follows  immediately  as 
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The  term  in  braces  is  the  moment  arm  for  the  rotation  of  point  Sj  about  joint 
z,  expressed  in  world  coordinates.  It  is  determined  by  ru,(-),  a  function  which 
gives  the  3D  position  of  a  point  in  template  coordinates  with  respect  to  the 
world  frame.  From  the  form  of  Eqn.  2.16,  the  Jacobian  component  for  a 
revolute  joint  is  obtained  by  projecting  a  spatial  velocity  vector  into  the  image 
plane.  In  cases  where  perspective  effects  are  signihcant,  the  orthographic 
mapping  is  replaced  by  an  affine  approximation  to  the  perspective  projection 
at  each  link. 

Using  Eqn.  2.16  in  a  tracking  algorithm  involves  the  following  steps:  First, 
the  spatial  positions  of  all  frames  are  computed  with  respect  to  the  world. 
Then  the  revolute  joints  are  examined  in  sequence.  For  each  joint,  the  tem¬ 
plate  planes  which  it  effects  are  processed  in  order.  Each  template  pixel 
involved  in  Eqn.  2.13  makes  a  contribution  to  the  Jacobian  which  is  deter¬ 
mined  solely  by  its  position  with  respect  to  the  active  joint  axis.  The  total 
cost  of  the  Jacobian  computation  depends  on  the  number  of  templates,  their 
size  in  pixels,  the  DOFs  of  the  object,  and  its  kinematic  topology.  Empirical 
evaluation  of  this  cost  and  its  ramihcations  for  real-time  implementation  are 
presented  in  Chpt.  4.  The  compact  derivation  of  Eqn.  2.16  and  the  simplicity 
of  its  computation  are  fortunate  consequences  of  the  highly  regular  structure 
of  spatial  kinematic  models. 

2.5  Tracking  Through  Feature  Alignment 

In  the  template  registration  approach  to  visual  tracking,  intensity  errors  are 
used  to  measure  the  geometric  misalignment  between  the  projected  model 
and  the  input  image.  Templates  provide  a  useful  level  of  generality,  and 
make  it  possible  to  exploit  arbitrary  texture  cues.  For  a  specihc  object  like 
the  hand,  however,  the  constraints  provided  by  template  matching  can  be 
approximated  by  purely  geometric  error  functions  involving  point  and  line 
features.  The  advantage  of  this  is  two-fold.  First,  geometric  residual  errors 
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Figure  2.7:  F 
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tions  can  be  obtained  from  the  deformation  function  of  Sec.  2.4. f.  In  this 
case,  only  a  single  line  in  the  template  plane,  corresponding  to  the  central 
axis  of  the  cylindrical  link,  is  mapped  through  the  deformation  function.  If 
Sj  represents  a  point  along  the  central  axis,  its  contribution  to  the  residual 
error  is  given  by 

lj(q,  s^-)  =  m^Wj  -  p  =  m^fj(q,  s^-)  -  p  ,  (2.17) 

where  m  =  [ab]. 

The  Jacobian  component  generated  by  this  residual  is 

=  ^  -  .  (2.18) 

The  role  of  the  line  feature  in  approximating  the  template  residual  can  be 
seen  by  comparing  Eqns.  2.18  and  2.13.  In  the  line  case,  the  normal  vector 
m  plays  the  same  role  as  the  image  gradient.  It  corresponds  to  an  image 
gradient  held  with  a  zero  component  along  the  central  axis  of  the  link. 


2.5.2  Point  Feature  Residual  and  Jacobian 


Links  at  the  end  of  kinematic  chains,  like  the  hngertips  of  the  hand,  gener¬ 
ate  point  features  with  parameters  [xy],  as  illustrated  in  Fig.  2.7.  The  tip 
residual  measures  the  Euclidean  distance  in  the  image  between  the  projected 
model  point  and  the  actual  tip  location,  Cj,  in  the  image: 


=  l|vj(q,s,)||  =  ||fj(q,s,)  -  c 


Its  Jacobian  component  is  given  by 


df 


dq 


j  '  j 


(2.19) 


(2.20) 


In  this  case,  the  unit  vector  in  the  Vj  direction  models  an  image  gradient 
that  is  nonzero  only  along  radial  lines  from  the  tip  feature  position. 


38 


CHAPTER  2.  TRACKING  WITH  KINEMATIC  MODELS 


The  residual  functions  in  Eqns.  2.17  and  2.19  measure  distances  in  the 
image  plane.  The  feature  residuals  for  each  link  and  tip  in  the  model  are 
concatenated  into  a  single  residual  vector,  R(q).  The  total  error  is  then 
given  by 

E(q)  =  tR(q)^R(q)  ,  (2,21) 

This  error  will  be  quadratic  in  the  distances  from  the  hand  model  projections 
to  the  image  features.  This  agrees  with  the  empirical  observation  that  SSD 
residual  errors  are  quadratic  around  their  minimum. 

Although  these  approximations  were  motivated  by  the  hand,  they  are 
applicable  to  the  body  as  well.  The  primary  difference  in  between  hngers 
and  limbs  is  that  clothing  can  provide  image  gradient  constraints  in  arbitrary 
directions,  unrelated  to  the  central  axis  of  the  limb.  However,  clothing  and 
background  color  will  still  often  differ  signihcantly,  resulting  in  a  strong  edge 
constraint.  If  the  interior  texture  is  insignihcant  given  the  resolution  of  the 
camera,  then  the  line  and  point  models  can  be  applied  without  modihcation. 

2.5.3  State  Estimation  by  Feature  Residual  Minimiza¬ 
tion 

The  state  estimation  problem  can  be  achieved  by  minimizing  the  total  error 
in  Eqn.  2.21.  This  is  a  classical  nonlinear  least-squares  problem,  which  can 
be  solved  numerically  by  Gauss-Newton  minimization  [14].  The  GN  state 
update  equation  is  given  by 

Hfc+i  =  W  ~  5  (2.22) 

where  Jfc  is  the  Jacobian  matrix  for  the  residual  R^,  both  of  which  are 
evaluated  at  q^.  S  is  a  constant  diagonal  conditioning  matrix  used  to  stabilize 
the  least  squares  solution  in  the  presence  of  kinematic  singularities.  Each 
entry  in  S  weights  one  of  the  state  variables,  determining  how  strongly  it  is 
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affected  by  the  residual.  Weights  for  quaternion,  translation,  and  joint  angle 
states  were  set  at  1.0,  10.0,  and  1000.0,  respectively.  These  weights  were 
chosen  empirically  and  used  in  all  of  the  experiments  in  this  thesis.  The 
above  GN  formulation  is  based  on  the  rigid  body  tracking  work  of  Lowe  [35]. 

2.5.4  Visual  Tracking  and  Kinematic  Singularities 

The  use  of  a  conditioning  matrix  in  the  state  estimator  of  Eqn.  2.22  closely 
parallels  methods  for  dealing  with  kinematic  singularities  in  robot  control 
(see  [40],  Eqn.  9.22.)  In  both  the  articulated  tracking  and  control  cases,  the 
goal  is  to  obtain  a  stabilized  inverse  solution  in  the  case  where  the  Jacobian 
has  lost  rank.  One  important  difference  between  the  estimation  and  control 
cases  is  the  influence  of  the  feature  model  on  the  singular  conflgurations.  The 
matrix  Jfc  in  the  robot  manipulator  case  maps  from  joint  velocities  to  link 
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Jacobians  will  be  nonzero  for  all  finger  articulations,  but  they  will  all  lie  in 
the  line  formed  by  intersecting  the  huger  and  image  planes.  This  line  will 
be  parallel  to  the  feature  measurement  lines  produced  by  the  huger,  leading 
to  loss  of  rank.  The  equations  rehect  the  intuitively  obvious  fact  that  when 
the  huger  is  curling  towards  the  camera,  the  motion  of  its  edges  contain  no 
information  about  the  3D  motion. 

Examination  of  the  point  feature  Jacobian  of  Eqn.  2.20  indicates  that 
it  possesses  the  same  two  singular  conhgurations  that  the  line  does.  How¬ 
ever,  the  orthogonal  case  is  much  less  serious  for  a  point  feature,  as  it  does 
not  generate  a  singular  subspace.  This  analysis  demonstrates  the  value  of 
the  closed  form  approximations  to  the  template  residuals.  They  lead  to  an 
intuitive  mathematical  description  of  a  basic  property  of  articulated  object 
tracking  problems. 

As  in  the  robot  manipulator  case  [40],  state  space  neighborhoods  of  the 
singular  points  will  exhibit  marked  sensitivity  loss,  in  that  large  state  space 
motions  will  have  little  ehect  on  the  image.  This  sensitivity  loss  makes  ac¬ 
curate  tracking  in  the  neighborhood  of  singularities  difficult.  Experimental 
observations  of  the  effects  of  near-singular  tracking  are  discussed  in  Chpt.  4. 
The  stabilization  method  of  Eqn.  2.22,  which  has  been  used  for  rigid  body 
tracking  [35],  also  works  for  articulated  state  estimation  problems. 

2.5.5  Tracking  with  Multiple  Cameras 

Both  the  template  registration  and  feature  alignment  approaches  generalize 
easily  to  tracking  with  more  than  one  camera.  When  multiple  cameras  are 
used,  the  residual  vectors  from  each  camera  are  concatenated  to  form  a  single 
global  residual  vector.  This  formulation  exploits  partial  observations.  If  a 
huger  link  is  visible  in  one  view  but  not  in  the  another  due  to  occlusion,  the 
single  view  measurement  is  still  incorporated  into  the  residual,  and  therefore 
the  estimate.  When  this  framework  is  augmented  with  occlusion-handling. 
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the  resulting  algorithm  can  utilize  any  visible  pixel  from  any  camera  position 
in  estimating  the  state.  Experimental  articulated  tracking  results  using  two 
cameras  were  hrst  reported  in  [46,  48].  Two  camera  results  for  human  body 
tracking  were  presented  more  recently  in  [29,  32]. 

2.6  Discussion 

Kinematic  models  made  up  of  links  and  joints  represent  the  most  basic  con¬ 
straints  on  the  motion  of  articulated  kinematic  chains,  and  make  it  possible  to 
recover  3D  motion  from  a  single  image  sequence.  A  kinematic  hand  model 
was  developed  through  anatomical  analysis  and  calibrated  using  an  inter¬ 
active  tracking  system.  Sections  2.4  and  2.5  described  two  approaches  to 
estimating  the  model  state  from  an  image  sequence. 

The  template  registration  approach  of  Sec.  2.4  belongs  to  the  class  of 
direct,  energy-based  vision  algorithms  which  was  popularized  by  deformable 
models  [65]  (including  2D  Snakes  [28],)  and  has  been  applied  to  a  wide  variety 
of  problems  [73,  53].  It  is  a  direct  method  in  which  pixels  are  mapped  to 
state  estimates  without  an  intervening  feature  detection  stage.  Its  advantage 
is  the  direct  enforcement  of  kinematic  constraints  on  image  interpretation. 
These  constraints  integrate  information  from  different  parts  of  the  image, 
reducing  the  impact  of  localized  interpretation  errors  on  the  hnal  estimate. 
This  will  turn  out  to  be  particularly  useful  in  tracking  self-occluding  objects 
in  Chpt.  3. 

Constraints  in  the  classic  energy-based  approach  take  the  form  of  a  smooth¬ 
ness  penalty  term  which  is  added  to  the  residual  error  in  forming  the  objective 
function.  These  soft  constraints  can  be  viewed  as  prior  distributions  over  the 
state  space  [62,  63].  They  are  enforced  explicitly,  reflecting  the  fact  that  the 
size  of  the  over-parameterized  state  space  exceeds  the  actual  DOFs  in  the 
scene.  In  contrast,  kinematic  constraints  are  enforced  implicitly  through  the 
joint  angle  parameterization  of  articulated  motion.  Kinematic  models  are 
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hard  constraints —  no  amount  of  measurement  error  should  cause  the  rigid 
bodies  in  a  chain  to  separate  from  each  other,  or  rotate  in  ways  not  permitted 
by  their  joints. 

Section  2.5  demonstrates  that  the  template  residual  functions  for  huger 
phalanges  can  be  approximated  by  geometric  expressions  in  line  and  point 
features.  The  result  is  a  second  tracking  approach  based  on  feature  align¬ 
ment.  The  residuals  for  point  and  line  features  have  a  closed  form  expression 
which  makes  the  singularity  analysis  of  Sec.  2.5.4  possible.  In  addition,  these 
features  can  be  detected  through  a  simple  algorithm  which  is  suitable  for 
real-time  implementation,  as  Sec.  4.3.1  will  demonstrate. 


Chapter  3 

Tracking  Self-Occluding 
Objects  with  Layered 
Templates 


Self-occlusion  is  an  ubiquitous  property  of  articulated  object  motion.  It 
occurs  when  an  occluding  link  blocks  the  camera’s  view  of  an  occluded  link. 
The  images  in  Fig.  3.1  illustrate  the  effect  of  occlusion  on  the  visual  tracking 
problem.  In  this  example,  the  hand  rotates  around  the  middle  huger  axis 
with  the  hngers  held  rigid.  Suppose  that  the  hrst  and  second  hngers  are  each 
being  tracked  with  a  single  template  using  the  approach  of  Sec.  2.4.  In  hgure 
(b),  the  tw 
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First  Occludes  Second 


First  and  Second  Disjoint 


Second  Occludes  First 


(a)  (b)  (c) 

Figure  3.1:  Three  snapshots  from  a  motion  sequence,  illustrating  the  different 
occlusion  relations  between  the  hrst  and  second  hngers  of  the  hand. 

main  components.  The  hrst  component  is  a  visibility  order  for  overlapping 
templates,  with  the  property  that  no  template  is  occluded  by  a  template 
that  follows  it  in  the  list.  The  visibility  order  can  be  used  to  determine 
which  template  corresponds  to  a  given  region  of  pixels.  The  order  between 
templates  changes  with  the  state,  as  in  the  transition  from  hgure  (a)  to  (c). 
In  (a),  the  visibility  order  is  {Template  1,  Template  2},  while  in  (c)  it  is 
the  reverse.  The  second  component  in  the  layered  model  is  a  set  of  window 
functions  that  block,  or  mask  out,  the  contributions  of  occluded  templates, 
as  determined  by  the  visibility  order.  Each  template  has  an  attached  window 
function  which  moves  with  it  as  a  function  of  the  state. 

Tracking  using  a  layered  representation  requires  the  simultaneous  solu¬ 
tion  of  two  problems:  determining  the  visibility  order  for  the  templates  that 
describe  an  object,  and  registering  the  overlapping  templates  to  the  input 
image.  In  bottom-up  approaches  to  occlusion  analysis,  visibility  order  is  es¬ 
timated  from  image  motion  [11,  67]  or  contours  [41].  This  thesis  explores 
an  alternative,  top-down  approach  which  uses  the  kinematic  model  in  con- 
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junction  with  a  high  image  sampling  rate  to  partition  the  state  space  into 
regions  with  a  fixed  visibility  order.  In  this  approach,  the  visibility  order  for 
the  current  frame  is  predicted  from  the  previous  state  estimate  and  used  to 
constrain  image  interpretation. 

The  following  sections  develop  the  layered  template  representation  in 
more  detail,  and  describe  its  use  in  a  model-based  tracking  algorithm  for 
self-occluding  objects.  The  hrst  step  is  an  analysis  of  the  visibility  orders 
for  objects,  like  the  hand,  that  are  composed  of  planar  kinematic  chains. 
The  next  step  is  the  incorporation  of  visibility-ordered  templates  into  the 
registration  algorithm  of  Sec.  2.4.  Window  functions  provide  a  mathemat¬ 
ical  tool  for  arbitrating  access  to  the  image  among  overlapping  templates. 
They  are  incorporated  into  a  residual  error  function,  which  is  minimized 
by  gradient-based  methods.  The  main  computational  step  in  gradient-based 
minimization  is  the  Jacobian  computation  for  layered  templates,  which  is  de¬ 
scribed  in  detail.  This  is  followed  by  a  discussion  of  image  segmentation,  and 
an  outline  of  the  complete  tracking  algorithm.  The  hnal  contribution  of  this 
chapter  is  an  analysis  of  the  existence  conditions  for  the  invariant  visibility 
orders  employed  in  tracking.  Occlusion  ambiguities,  in  which  the  visibility 
order  is  not  invariant,  are  introduced  and  their  ramihcations  for  tracking  are 
discussed. 

3.1  Model-based  Occlusion  Analysis 

The  tracking  algorithm  developed  in  this  chapter  is  based  on  a  simple  ob¬ 
servation:  the  occlusion  relationships  between  the  convex  rigid  bodies  of  an 
articulated  object  in  motion  rarely  change  instantaneously.  As  a  result,  the 
visibility  order  for  the  object  templates  is  invariant  under  the  small  motions 
that  occur  between  two  frames  of  an  image  sequence,  given  a  high  sampling 
rate.  This  invariant  order  makes  it  possible  to  remove  the  discrete,  com¬ 
binatoric  aspect  of  occlusion  from  the  tracking  problem,  leaving  only  the 
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Disjoint 


Second 
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First 

Occludes 
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Viewing 

Direction 


Figure  3.2:  The  partition  of  the  rotation  space  (unit  circle)  into  regions  with 
an  invariant  visibility  order.  This  is  a  top  view  of  the  scene  in  Fig.  3.1,  with 
the  camera  located  on  the  right.  (/>  gives  the  rotation  of  the  hand  relative  to 
the  camera. 


registration  of  overlapping  templates.  In  this  section,  the  use  of  an  invariant 
visibility  order  in  tracking  is  illustrated  for  the  two  huger  motion  sequence 
of  Fig.  3.1. 

Figure  3.2  shows  the  visibility  order  for  the  hngers  in  Fig.  3.1  as  a  function 
of  the  hand  state.  Since  the  hand  has  one  DOF  in  this  example,  the  space 
of  rotations  is  a  unit  circle.  The  angles  marked  A,  (7,  D  denote  occlusion 
events^  points  at  which  the  occlusion  relations  change.  Passing  through  (/>  = 
A,  for  example,  causes  a  transition  from  (a)  to  (b).  The  amount  of  hand 
rotation  between  frames  is  limited  by  the  sampling  rate  to  a  small  angle, 
A(j).  Therefore,  in  local  tracking  the  state  estimate  for  the  current  frame 
is  restricted  to  a  motion  interval  of  ±A(/)  around  the  previous  estimate.  If 
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the  is  the  state  from  the  previous  frame,  then  it  follows  that  <^k  £ 

+  Acf)]. 

Since  the  occlusion  events  are  sparsely  distributed,  the  visibility  order 
for  the  two  templates  will  be  constant  from  frame  to  frame  across  most  of 
the  image  sequence.  The  template  order  in  cases  (a)  and  (c),  for  example, 
holds  for  nearly  90  degrees  of  hand  rotation,  which  is  much  larger  than  Acf). 
When  the  motion  interval  contains  an  occlusion  event,  the  visibility  order 
will  change.  However,  the  transition  always  occurs  between  an  occluded 
and  a  disjoint  case.  As  a  result,  the  onset  of  occlusion  can  be  anticipated 
by  assigning  the  occluded  visibility  order  to  the  disjoint  case  near  the  event. 
This  assignment  is  achieved  by  growing  the  occluded  regions  into  the  disjoint 
regions  by  the  motion  bound,  A(/),  resulting  in  the  state  space  partition  shown 
in  Fig.  3.2  as  dark  and  light  grey  bands. 

The  partition  illustrated  in  Fig.  3.2  divides  the  state  space  into  regions 
with  a  locally  invariant  visibility  order.  This  partition  has  the  following 
property:  Given  the  .state  of  the  object  at  time  k,  its  membership  in  the 
.state  partition  determines  the  visibility  order  at  time  k  1.  The  occluded 
partitions  (the  light  and  dark  grey  sets  in  Fig.  3.2)  contain  all  of  the  states 
that  lie  within  EAf  of  an  occluded  conhguration.  The  disjoint  partitions 
(the  white  sets  in  Fig.  3.2)  contain  the  states  for  which  there  are  guaranteed 
to  be  no  occlusions  under  bounded  motion.  These  sets  form  a  buffer  zone 
in  which  the  tracker  can  be  conhgured  for  the  next  occlusion  event.  The 
partition  is  used  in  visual  tracking  problems  to  predict  the  visibility  order  for 
the  current,  unknown  state  from  the  previous  state  estimate.  The  predicted 
visibility  order  is  used  in  turn  to  construct  a  layered  template  representation 
of  the  image,  thereby  reducing  the  tracking  problem  to  the  registration  of 
overlapping  templates  between  frames. 

The  construction  of  the  partition  in  Fig.  3.2  depends  on  three  properties 
of  the  motion  and  the  estimator.  First,  the  regions  in  state  space  in  which 
the  templates  occlude  each  other  must  be  separated  by  regions  in  which 
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they  are  disjoint.  Second,  the  change  in  state  between  frames  must  be  small 
enough  to  take  advantage  of  the  disjoint  regions.  If  A(/)  is  too  large,  growing 
the  occluded  regions  will  eliminate  the  disjoint  regions  entirely.  Third,  the 
state  estimate  must  be  accurate  enough  to  make  useful  predictions  about 
membership  in  the  partition.  These  issues  are  addressed  in  more  detail  in 
the  sections  that  follow. 


3.2  Visibility  Orders  for  Planar  Kinematic 
Chains 

A  key  step  in  the  tracking  algorithm  for  self-occluding  motion  is  the  con¬ 
struction  of  visibility  orders  for  link  templates.  A  visibility  order  for  the 
bodies  in  an  articulated  object  is  an  ordered  list  with  the  property  that  each 
body  will  not  be  occluded  by  any  of  the  bodies  that  follow  it.  The  next  three 
sections  present  a  set  of  rules  for  constructing  invariant  visibility  orders  for 
objects,  like  the  hand,  that  are  composed  of  planar  kinematic  chains.  Section 
3.4  discusses  the  existence  of  these  invariant  orders  in  the  general  case.  The 
simplest  type  of  visibility  order  is  a  binary  occlusion  relation  between  two 
bodies. 

3.2.1  Binary  Occlusion  Relations 

When  the  image  plane  projections  of  two  objects  overlap,  and  the  visibility 
of  one  of  them  (object  A)  is  completely  unaffected  by  the  other  (object  5), 
it  is  called  a  binary  occlusion  and  A  occludes  B.  If  two  solid  objects  have 
convex  shapes,  then  any  occlusion  between  them  will  be  binary.^ 

Consider  a  pair  of  convex  objects  undergoing  bounded  motion,  such  as 
would  occur  between  two  frames  in  an  image  sequence.  If  the  image  plane 

^Any  two  convex  bodies  can  be  separated  by  a  plane  which  divides  the  viewing  sphere 
in  half,  and  for  all  view  points  in  each  half,  the  object  it  contains  is  completely  visible. 
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projections  of  these  bodies  do  not  overlap  under  the  allowed  motion,  no 
occlusion  is  possible.  In  this  case,  the  bodies  A  and  B  are  disjoint,  which 
is  written  A  =  B.  The  binary  occlusion  relation,  will-occlude(A,B),  is  true 
if  A  and  B  are  not  disjoint  and  A  occludes  B  whenever  their  image  plane 
projections  overlap.  This  is  written  A  y  B.  For  example,  if  two  occluding 
objects  are  located  at  distances  Za  and  Zb  with  respect  to  the  camera,  such 
that  Za  <  Zb  over  some  range  of  motion,  then  A  y  B. 

The  will-occlude  relation  describes  a  property  of  all  possible  occlusions 
of  the  two  bodies  under  limited  motion.  For  most  articulated  objects  (see 
Sec.  3.4,)  one  of  A  =  B,  A  y  B,  and  B  y  A  will  be  true  for  each  pair  of 
bodies  in  all  conhgurations.  Because  these  relations  are  hxed  over  a  motion 
interval,  they  dehne  a  local  occlusion  invariant.  The  invariant  is  local  because 
it  only  holds  under  bounded  motions  of  the  object  around  an  operating  point 
in  state  space.  Since  the  occlusion  relations  are  dehned  for  objects  with 
arbitrary  degrees  of  freedom,  they  generalize  the  concept  of  depth  sorting  in 
constructing  a  layered  representation.  Section  3.4  discusses  the  construction 
of  visibility  orders  for  general  articulated  objects  from  the  set  of  pairwise 
occlusion  relations  for  its  bodies. 

3.2.2  Occlusion  Relations  for  Revolute  Joints 

The  on-line  determination  of  visibility  orders  can  be  greatly  simplihed  for  an 
object  like  the  hand  through  kinematic  analysis.  The  hrst  step  is  to  dehne 
the  will-occlude  relation  for  two  links  connected  by  a  revolute  joint,  a  basic 
component  of  articulated  kinematic  chains.  Using  this  dehnition  and  the 
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Camera 


Figure  3.3:  Occlusion  properties  of  two  links  connected  by  a  revolute  joint. 


between  any  two  links  in  a  kinematic  c 


3.2.  VISIBILITY  ORDERS  FOR  PLANAR  KINEMATIC  CHAINS 


51 


angle  6y^  between  Ey  and  A: 

6y  >  0  :  G  [0, 6y]  B  occludes  A 

G  [0y  —  7r,0]  A  occludes  B  /q  i\ 

6y  <Q  \  G  ^  B  occludes  A 

0j  G  [0,  0^,  +  tt]  a  occludes  B 

In  Fig.  3.3,  9y  >  0  and  9y  —  tt  <  9 ^  <  0,  so  that  A  is  occluding  B.  Occlusion 
properties  change  at  the  boundaries  of  the  intervals.  Note  that  9j  is  bounded 
away  from  zero  on  both  sides  by  noninterpenetration. 

As  the  viewpoint  moves  out  of  the  joint  plane,  the  amount  of  occluded 
surface  area  decreases.  When  the  general  viewing  vector,  A,  is  parallel  to 
iij  there  is  essentially  no  occlusion  for  all  joint  angles.  E  makes  an  angle 
9n  with  the  joint  plane,  in  which  it  has  the  projection  Ey.  It  follows  that 
any  viewing  direction  can  be  represented  in  the  joint  coordinate  frame  by 
two  angles:  9y  and  The  occlusion  conditions  from  Eqn.  3.1  apply  only  to 
viewpoints  for  which  |0n|  <  A„,  for  some  hxed  threshold  A„.  For  viewpoints 
above  this  threshold,  the  links  are  disjoint. 

Given  the  state  of  an  articulated  object,  Fqn.  3.1  can  be  applied  to  de¬ 
termine  the  occlusion  at  a  revolute  joint.  To  use  this  model  for  tracking,  it 
must  be  extended  to  include  bounded  motions  of  the  two  links.  Bounded 
change  in  the  DOFs  before  link  A  in  the  kinematic  chain  will  displace  the 
joint  coordinate  frame,  causing  and  9y  to  vary.  The  exact  change  in  these 
angles  will  be  a  complex  function  of  the  state,  but  it  can  be  approximated  by 
restricting  them  to  intervals,  /„  and  A,  of  a  hxed  size,  centered  around  their 
current  value.  Bounded  motion  between  B  and  A  is  modeled  by  an  interval 
Ij  =  [0°  —  Aj,0°  -|-  Aj],  of  width  Aj  containing  9j.  The  intervals  /„  and 
ly  are  dehned  similarly.  These  intervals  can  be  incorporated  into  Eqn.  3.1 
by  replacing  inequalities  with  intersection  tests.  The  normal,  viewing,  and 
joint  angles  at  the  current  state  are  0°,  0°,  and  0°,  respectively.  The  revolute 
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Otherwise 

3.2.3  Visibility  Orders  for 


Ij  n  [0,  7^  0  B  y  A 

/j  n  [6*°  —  —  TT,  0]  7^  0  A  y  B 

I 3  0  —  Ay ^  (l)  B  A 

Ij  n  [0,  9^  +  A^,  +  7r]7^0  A  r~  B 

^  A  =  B 

Hand  Templates 


The  kinematic  properties  of  objects  like  the  hand  can  be  exploited  in  an 
algorithm  for  visibility  ordering  link  templates.  In  this  approach,  templates 
are  ordered  within  each  huger  chain  using  the  revolute  occlusion  relation  de¬ 
scribed  above.  Then  the  chains  are  compared  as  distinct  objects,  avoiding 
the  complexity  of  testing  each  link  against  all  the  others.  By  exploiting  the 
kinematic  structure,  the  algorithm  is  efficient  enough  for  on-line  implemen¬ 
tation.  A  more  general  approach  to  computing  visibility  orders  from  binary 
occlusion  relations  is  described  in  Sec.  3.4.^ 

The  hand  consists  of  hve  planar  kinematic  huger  chains  and  a  rigid  palm. 
As  a  result  of  planarity,  the  three  joint  axes  in  each  huger  are  parallel  and 
have  the  same  joint  plane.  This  greatly  simplihes  the  application  of  revolute 
occlusion  relations  to  huger  ordering.  A  further  simplihcation  comes  from  the 
fact  that  all  joint  angles  must  be  positive,  rehecting  physical  limits  on  joint 
motion.  As  a  result,  each  huger  can  be  viewed  as  a  convex  planar  shape. 
These  two  observations  lead  to  a  simple  procedure  for  ordering  templates 
within  each  link. 

If  the  angle,  between  the  camera  and  the  huger  joint  plane  exceeds 
the  threshold,  A„,  described  in  Sec.  3.2.2,  then  the  huger  templates  are  dis¬ 
joint  and  can  be  ordered  arbitrarily.  Otherwise,  two  applications  of  Eqn.  3.2 
determine  the  ordering  between  links  1  and  2,  and  links  2  and  3  (see  Fig.  2.5 

^Note,  however,  that  the  revolute  occlusion  relation  dehned  above  applies  only  to  pairs 
of  links  that  share  a  joint.  This  dehnition  would  have  to  be  extended  to  an  arbitrary  pair 
of  links  to  meet  the  requirements  of  the  general  approach. 
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for  the  link  numbers,  which  are  the  same  for  each  huger.)  Convexity  im¬ 
poses  strong  constraints  on  the  global  pose  of  the  huger,  making  it  possible 
to  generate  the  entire  visibility  order  directly  from  the  two  pairwise  tests, 
according  to  the  following  table: 

1^2  or  2^3  ^  1^3 

2^1  or  3^2  ^  3^1  (3.3) 

1  =  2  and  2  =  3  ^  1=3 

The  thumb  is  also  a  planar  mechanism  with  joint  limits,  and  the  huger  tem¬ 
plate  ordering  rules  can  be  applied  to  it  without  modihcation. 

Occlusions  between  hngers  are  almost  always  binary.  This  observation 
simplihes  visibility  ordering  by  removing  the  need  to  consider  individual  tem¬ 
plates.  When  the  planes  for  two  hngers  are  parallel,  they  can  be  ordered  by 
distance  from  the  camera.  When  the  planes  intersect,  there  are  three  possi¬ 
bilities,  illustrated  in  Fig.  3.4  (a),  (b),  and  (c).  In  (a),  neither  chain  crosses 
the  dividing  line  formed  by  the  plane  intersection.  In  this  case,  the  two  planes 
divide  3D  space  into  four  quadrants,  with  associated  visibility  orders  given  in 
the  hgure.  In  (b),  one  chain  crosses  the  dividing  line,  but  the  other  does  not. 
In  this  case  the  quadrant  labels  are  diherent.  Note  that  the  transition  from 
(a)  to  (b)  either  leaves  the  visibility  order  unchanged,  or  changes  a  disjoint 
situation  to  an  ordered  one. 

The  only  case  where  nontrivial  interaction  between  the  chains  occurs  is 
(c),  where  they  both  cross  the  dividing  line.  This  case  requires  additional 
analysis  within  the  plane  of  the  huger.  The  hrst  step  is  to  choose  the  plane 
closest  to  the  camera,  in  which  the  occlusion  effect  is  most  visible,  and  project 
the  camera  viewpoint  into  that  plane.  Due  to  convexity,  the  other  chain  will 
intersect  this  plane  at  one  point.  Figure  3.4  (d)  shows  a  sample  conhguration 
of  links  in  this  case.  If  the  line  in  the  plane  joining  the  projected  viewpoint 
and  the  intersection  point  passes  through  the  chain,  then  the  chain  comes 
hrst  in  the  visibility  order.  Otherwise,  the  intersecting  chain  comes  hrst. 
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Figure  3.4:  Types  of  intersections  between  two  planar  kinematic  chains.  In 
(a),  chains  are  conhned  to  separate  sides  of  the  dividing  line  at  which  their 
planes  intersect.  In  (b)  one  chain  crosses  the  line,  and  in  (c)  they  both  do. 
The  viewpoint  relative  to  the  dividing  line  determines  the  visibility  order, 
(d)  shows  the  ordering  test  from  (c)  in  the  chain  2  plane. 


In  order  for  the  plane  intersection  test  to  be  valid  under  bounded  motion, 
it  is  necessary  to  model  the  effect  of  chain  motion  within  the  plane  and  motion 
of  the  plane  itself  on  the  outcome.  If  the  test  is  applied  between  hngers 
and  thumb  on  the  same  hand,  then  palm  motion  will  not  effect  the  type 
of  intersection,  but  may  change  the  camera’s  quadrant.  Since  the  decision 
hinges  on  whether  each  chain  crosses  the  dividing  line,  this  can  be  modeled 
by  bounding  the  distance  to  the  line  for  the  closest  part  of  each  chain.  The 
only  nontrivial  transition  is  from  case  (b)  to  (c).  In  this  situation,  a  huger 
or  thumb  tip  intersects  the  other  chain’s  plane  for  the  hrst  time.  The  point 
of  intersection  can  be  predicted  from  the  motion,  or  bounded  by  intersecting 
the  bound  on  tip  displacement  with  the  plane. 

Finger  planes  will  intersect  each  other  due  to  abduction.  However,  these 
planes  are  roughly  parallel,  and  the  intersections  will  almost  always  be  of  type 
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(a)  in  Fig.  3.4.  As  a  result,  there  is  a  simple  visibility  ordering  algorithm 
for  the  hngers:  sort  the  anchor  points  for  each  huger  based  on  distance  to 
the  camera  along  the  optical  axis.  This  determines  the  huger  ordering.  The 
thumb  plane  can  intersect  the  huger  planes  in  a  variety  of  ways  depending 
on  the  motion,  and  the  intersection  tests  described  above  must  be  applied 
in  this  case.  In  most  situations,  the  outcome  of  the  test  between  the  thumb 
and  hrst  huger  can  be  applied  to  the  rest  of  the  hngers  as  well. 

Finally,  the  plane  of  the  palm  sweeps  out  a  volume  in  space  in  the  di¬ 
rection  of  the  camera  axis.  If  the  tip  of  a  huger  or  the  thumb  intersects 
this  volume,  then  the  palm  comes  before  that  chain  in  the  visibility  order, 
otherwise  after.  A  visibility  order  for  hand  templates  can  be  constructed 
from  the  tests  described  above.  These  tests  are  simple  to  implement,  making 
it  possible  to  update  the  ordering  on-line  whenever  a  new  state  estimate  is 
available.  Note  that  the  fundamental  assumption  in  the  above  analysis  is 
the  planarity  of  the  kinematic  chains  comprising  the  object.  This  modeling 
assumption  is  also  valid  for  arms  and  legs,  suggesting  that  the  ordering  tests 
described  above  could  also  be  applied  to  human  hgures. 

3.3  Estimation  with  Layered  Templates 

Using  the  techniques  from  the  previous  section,  hand  templates  can  be  main¬ 
tained  in  visibility  order  during  tracking.  This  section  describes  an  algorithm 
for  registering  an  ordered  set  of  overlapping  templates  to  an  input  image. 
Tracking  is  achieved  by  applying  this  algorithm  to  each  frame  in  a  motion 
sequence,  using  the  estimated  state  from  the  previous  frame  as  the  starting 
point  for  registration.  Window  functions  are  the  key  to  registration.  They 
model  the  appearance  and  disappearance  of  template  pixels  as  a  result  of 
the  image  plane  motion  of  overlapping  templates.  The  resulting  gradient- 
based  minimization  problem  requires  the  derivation  of  Jacobians  for  layered 
templates,  and  algorithms  for  image  segmentation.  These  components  are 
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Figure  3.5:  Image  composition  example  for  two  ID  templates.  Occlusion  is 
modeled  by  the  unit  window  function  shown  on  the  right. 
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two  templates  are  combined  to  give  a  composite  image: 

Icix)  =  Mi(x  —  xi)Ii{x  —  xi)  +  [1  —  Mi{x  —  xi)]l2{x  —  X2)  (3-4) 

where  /i,2(')  are  the  templates  and  Mi  is  the  window  function  for  template 
1.  Given  m{-,L),  a  unit  window  of  length  L  for  ID  images,  it  follows  that 
Mi(-)  =  m(-,  hi). 

/c(-)  represents  the  forward  model  of  the  image  as  a  function  of  the  state. 
The  2D  version  of  this  function  is  formed  by  combining  the  deformable  tem¬ 
plate  model  of  Sec.  2.4.1  with  the  layered  occlusion  representation  described 
above.  A  2D  version  of  Eqn.  3.4  can  be  written 


G(q,w)  =  Mi(q,w)/i(fi  (q,  w))  +  [1  -  Mi(q,  w)]/2(f2  (q,w)),(3.5) 

where  are  inverse  deformation  functions  for  the  two  templates  that  map 
from  image  coordinates  to  template  coordinates  as  a  function  of  the  state. 
Since  the  functions  fi^2  are  affine  in  the  image  coordinates,  their  inverses  are 
well-dehned.  Mi(q,  w)  denotes  the  window  function  for  template  1,  posi¬ 
tioned  in  the  image.  It  is  dehned  for  a  general  template,  /j,  as 


V,(q,w)  =  m,(f,  (q,w))  , 


(3.6) 


where  m^{s)  is  a  2D  unit  window  in  template  coordinates,  that  is  equal  to 
one  inside  the  template’s  boundary  contour  and  zero  everywhere  else,  as 
illustrated  in  Fig.  3.6. 

The  incorporation  of  the  composite  image  in  an  SSD  residual  can  be 
illustrated  in  the  more  complicated  case  of  adding  a  background  template, 
A,  to  Eqn.  3.5  obtaining 

^(q)  =  ^^[4(w)  -  G(q,w)]Mw 

=  \  -  ^i(q.w)/i(fh^(q,w))  -  [1  -  Mi(q,w)]  X 

{M2(q,  w)/2(f2“^(q,  w))  -7  [1  -  M2(q,  w)]/6(w)}]Mw  ,(3.7) 
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Template  Window  Function 


□  1  Qo 


Boundary 

Contour 


Figure  3.6:  A  template  and  its  associated  unit  window  function  are  illustrated 
for  the  finger  tip. 


where  Mi^2  are  window  functions  for  the  two  hltered  templates,  /i^2,  and  R 
is  the  hltered  background. 


A  set  of  templates  in  visibility  order  results  in  a  recursiv 
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Decreasing 
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Ijj  Background 


Figure  3.7:  Tree  of  window  functions  generated  by  a  set  of  templates, 
/i,  /2, .  .  .  ,  /„,  in  visibility  order.  R  is  the  background  template. 


3.3.2  Minimization  of  Layered  Template  Error 

The  gradient  descent  tracking  algorithm  described  in  Sec.  2.4.3  can  be  applied 
to  error  functions  such  as  Eqn.  3.7,  registering  the  layered  templates  to  an 
input  image.  Minimizing  the  error  generates  a  segmentation  of  the  image. 


60 


CHAPTER  3.  TRACKING  SELF-OCCLUDING  OBJECTS 


Two  algorithms  for  obtaining  the  image  segmentation  in  step  1  are  described 
in  Sec.  3.3.4.  Once  an  image  pixel  has  been  assigned  to  a  template,  its  corre¬ 
sponding  template  pixel  is  determined  by  the  inverse  deformation  function, 
and  the  residual  follows  easily.  The  remaining  step  is  the  computation  of  the 
residual  Jacobian. 

3.3.3  Residual  Jacobian  Computation 

Suppose  an  image  pixel,  w',  originates  from  template  Ij  at  template  co¬ 
ordinate  Sj.  Furthermore,  let  p'  denote  the  3D  position  of  Sj  in  camera 
coordinates,  as  determined  by  the  position  of  the  template  plane.  The  pixel 
at  w'  makes  the  following  contribution  to  the  residual 

i?' = /(w')  - /c(q,  w')  .  (3.8) 

There  are  two  possible  cases  for  the  pixel  w':  either  it  is  in  the  interior 
of  /j,  or  it  is  on  the  boundary  of  Ij  and  a  second  template  R,  where  j  <  k. 
The  Jacobian  calculations  in  these  cases  rely  on  two  assumptions:  that  the 
window  functions  are  constant  in  the  template  interiors  and  fall  to  zero  at 
their  boundaries,  and  that  the  bodies  are  opaque,  so  that  no  more  than  two 
templates  can  effect  a  pixel  value  simultaneously. 

For  the  interior  pixel  case,  the  ordered  templates  can  be  divided  into  a 
group,  {/i, .  .  . , /j_i},  that  occludes  R  and  a  group,  {/j+i, .  .  . , /„}  that  is 
occluded  by  it,  as  illustrated  in  Fig.  3.8  (a).  Window  functions  and  their 
gradients  for  the  occluding  templates  are  zero  at  w',  leading  to  the  simplih- 
cation 

R'  =  /(w')  -  [1  -  Mi(q,  w')][l  -  M2(q,  w')]  •  •  •  [1  -  w')]  X 

M,(q,w')/,(f-i(q,s'))  - /-  .  (3.9) 

The  second  term  in  Fqn.  3.9  is  produced  by  descending  the  window  tree  to 
node  R.  The  gradients  of  its  window  functions  are  zero  at  w',  so  its  only 
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of  rrik  at  s^.  This  implies  that  the  templates  between  j  and  k  in  the  tree 
vanish  at  w'  along  with  their  derivatives.  Furthermore,  as  in  the  interior 
case,  the  templates  occluding  Ij  and  occluded  by  R  make  no  contribution  to 
the  Jacobian.  This  results  in  the  simplihed  residual, 

i?'  =  /(w')-[l-Mi(q,w')]---Mj(q,w')/j(fj(q,s^.))- 

[1  -  Mi(q,  w')]  •  •  •  [1  -  Mj(q,  w')]  •  •  •  Mfc(q,  w')4(ffc(q,  ^^)])1) 

Since  this  is  in  the  form  of  Eqn.  3.9,  it  has  an  interior  Jacobian  component 
as  before.  Window  function  gradients  from  the  last  two  terms  yield  an  ad¬ 
ditional  component.  In  these  terms,  only  Mj  (q,  w')  has  a  nonzero  derivative 
at  w'.  Substituting  Eqn.  3.6  for  Mj  and  differentiating  yields  the  boundary 
Jacobian 

(hn  - 

Jf.(s')  =  [4(s,)-/,(s,)]^^  ,  (3.12) 

This  boundary  component  captures  the  effect  of  occlusion  in  covering  and 
revealing  pixels  as  the  state  changes. 

The  above  discussion  shows  that  the  residual  Jacobian  for  a  template 
has  two  basic  types  of  components:  region  contributions  from  Eqn.  3.10  and 
boundary  contributions  from  Eqn.  3.12.  This  suggests  a  simple  algorithm  for 
Jacobian  computation: 

1.  Scan  the  segmented  image  and  compute  the  region  contribution  to  the 
Jacobian  at  each  visible  pixel,  using  Eqn.  3.10. 

2.  Scan  the  discretized  boundary  of  each  template.  If  a  boundary  point  is 
visible,  identify  the  template  it  is  occluding  and  compute  the  boundary 
Jacobian  term  from  Eqn.  3.12. 

The  above  algorithm,  along  with  the  segmentation  algorithm  described  in 
the  next  section,  forms  the  basis  for  gradient-based  local  tracking. 
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3.3.4  Algorithms  for  Image  Segmentation 

Each  pixel  in  the  input  image  must  be  assigned  to  a  template  in  order  to  com¬ 
pute  its  contribution  to  the  gradient.  This  segmentation  problem  is  closely 
related  to  the  visible  surface  determination  problem  in  computer  graphics: 
Given  a  set  of  polygons  in  camera  coordinates,  identify  and  scan-convert^  the 
parts  that  are  visible.  Through  this  analogy,  segmentation  algorithms  can 
be  divided  into  two  classes:  list-priority  and  scan-line  (see  [18],  Sec.  15.11.) 

Templates  are  scanned  sequentially  in  visibility  order  in  the  list-priority 
approach,  and  the  most  visible  template  is  converted  last.  Each  template  is 
scan-converted  independently,  and  its  pixels  in  the  input  image  are  labeled. 
The  visibility  ordering  ensures  that  each  pixel  is  correctly  labeled  at  the  end 
of  this  hrst  stage.  The  labeled  pixels  are  then  rescanned  in  a  second  stage 
to  compute  the  Jacobian,  as  discussed  in  the  previous  section.  Pixels  con¬ 
tained  by  overlapping  templates  are  processed  multiple  times,  but  template 
conversion  and  pixel  labeling  is  simple  and  fast.  This  is  the  segmentation 
algorithm  used  in  the  experiments  of  Chpt.  4.  Because  of  the  visibility  order, 
this  approach  is  superior  to  the  standard  computer  graphics  depth  sorting 
algorithm,  which  often  splits  polygons  that  can  be  correctly  ordered  ([18], 
Eig.  15.27.) 

In  contrast,  scan-line  algorithms  sort  the  template  edges  on  x  and  y, 
and  scan  the  image  one  line  at  a  time.  When  templates  overlap,  the  visi¬ 
bility  order  determines  the  pixel  assignment,  and  coherence  is  used  to  avoid 
unnecessary  comparisons.  The  binary  occlusion  assumption  plays  the  same 
role  for  coherence  as  polygon  nonpenetration  in  the  graphics  case.  Scan-line 
algorithms  are  more  efficient  than  list-priority  algorithms:  Each  pixel  is  pro¬ 
cessed  once,  avoiding  redundant  calculations.  The  Jacobian  can  be  computed 
in  one  pass,  avoiding  a  labeling  stage.  They  are,  however,  more  complicated 

^In  scan-conversion,  polygons  (specified  by  a  set  of  vertices)  are  mapped  into  their 
component  pixels  in  the  frame  buffer. 
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to  implement. 

The  scan-line  approach  can  also  be  used  in  situations  where  a  visibility 
order  is  not  available.  In  this  case,  the  depth  at  each  pixel  in  the  scan-line 
determines  the  template  order.  In  this  approach,  template  order  is  computed 
in  conjunction  with  segmentation.  However,  preservation  of  the  ordering 
under  bounded  motion  is  not  guaranteed,  and  this  approach  may  require  a 
prohibitively  high  sampling  rate  to  work  in  practice.  Moreover,  in  the  case 
where  the  template  ordering  is  hxed  for  a  number  of  estimation  steps,  this 
version  of  the  scan-line  algorithm  is  inefficient,  as  it  recomputes  the  visibility 
order  each  time. 


3.4  The  Existence  of  Visibility  Orders 

The  existence  of  a  visibility  ordering  algorithm  for  the  hand  raises  the  ques¬ 
tion  of  what  other  objects  can  be  treated  under  the  same  framework.  This 
section  develops  general  existence  conditions  for  visibility  orders.  These  re¬ 
sults  apply  to  a  multibody  system  with  arbitrary  degrees  of  freedom. 

3.4.1  Existence  Conditions  for  Occlusion  Relations 

A  multi-body  system  has  a  local  occlusion  invariant  if,  for  a  given  bounded 
motion,  one  of  A  =  A  y  B  y  A  is  true  for  each  pair  of  bodies,  A 
and  B.  A  visibility  order  can  be  constructed  in  this  case,  as  shown  in  the 
next  section.  Bounded  relative  motion  between  the  two  bodies  is  modeled 
by  M(H),  the  union  of  all  possible  spatial  positions  of  B  with  respect  to  A’s 
coordinate  frame. In  general,  M[B)  will  not  be  convex.  But  its  convex  hull, 
CH[M[B)]^  can  be  partitioned  from  A  by  a  separating  plane  if  the  occlusion 
is  unambiguous.  This  is  illustrated  in  Fig.  3.9  (a)  for  two  2D  bodies  viewed 

■^The  spatial  position  of  each  body  is  defined  with  respect  to  the  world  coordinate 
frame.  Above,  the  reference  frame  is  shifted  to  A  for  convenience. 
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Figure  3.9:  Occlusion  relations  for  2D  objects  viewed  by  a  ID  camera,  (a) 
Sufficient  conditions  for  A  y  (b)  geometric  definition  of  occlusion  ambigu¬ 
ity,  and  (c)  degenerate  configuration  of  two  planar  objects  in  point  contact. 
No  nonzero  bound  on  relative  translation  can  remove  the  occlusion  ambigu¬ 
ity. 


by  a  ID  camera.  The  relative  motion  in  this  case  is  rotation  of  B.  The 
partition  creates  two  half-spaces.  If  the  image  plane  projections  of  A  and  the 
motion  image  of  5,  CH[M{B)\,  don’t  overlap,  A  =  B.  If  they  do  overlap,  the 
object  in  the  half-space  containing  the  camera  will  occlude  the  other  object. 
In  hgure  (a),  B  y  A. 

The  case  of  occlusion  ambiguity  is  illustrated  in  Fig.  3.9  (b),  using  the 
same  two  bodies.  For  this  conhguration,  it  is  impossible  to  predict  the  oc¬ 
cluder  under  the  given  motion  bound.  Ambiguity  arises  when  CH[M[B)] 
intersects  the  occluding  limb  of  A.  Referring  to  the  hgure,  let  EV  denote 
the  pair  of  line-of-sight  tangents  to  A,  with  E^  closest  to  B.  The  points 
of  contact,  are  the  occluding  limbs  (in  3D  this  is  a  curve  in  the  sur- 
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face  of  A).  The  pair  of  tangents  bound  a  region  of  space,  Oa  (a  tangent 
cone  in  3D,)  which  contains  A  and  the  camera  viewpoint.  Oa  is  divided 
into  occluding  and  occluded  regions,  labeled  O'^  and  O^.  Occlusion  ambigu¬ 
ity  arises  when  M[B)  has  a  nonzero  intersection  with  both  regions.  In  this 
case,  C H[M{B)]  intersects  A  and  contains  and  both  binary  occlusion 
outcomes  are  possible.  In  general,  the  likelihood  of  an  occlusion  ambiguity 
decreases  with  the  motion  bound,  but  it  can’t  be  eliminated  altogether,  as 
hgure  (c)  demonstrates. 

When  they  exist,  the  set  of  ambiguous  conhgurations  will  occupy  a  small 
subspace  of  the  total  conhguration  space,  as  they  depend  on  a  special  combi¬ 
nation  of  spatial  proximity  and  viewing  angle.  An  example  of  an  ambiguous 
hand  conhguration  is  the  “stop”  gesture,  with  the  hand  held  hat,  hngers 
pressed  together,  and  palm  facing  the  camera.  In  this  pose,  rotation  around 
the  vertical  axis  changes  the  visibility  order  of  the  hngers.  In  a  specihc  case 
like  the  hand,  knowledge  about  ambiguous  conhgurations  can  be  used  to  aid 
tracking.  Simple  velocity-based  prediction,  for  example,  could  be  used  to 
correctly  interpret  ambiguous  cases.  In  general,  high  frame  rates  reduce  the 
danger  of  an  incorrect  occlusion  hypothesis,  by  making  the  mislabeled  region 
of  pixels  as  small  as  possible. 

3.4.2  Visibility  Ordering  and  Occlusion  Graphs 

The  occlusion  relations  for  a  multi-body  system  can  be  represented  by  a 
directed  occlusion  graph.  The  graph  is  a  pair  (D,  A),  where  the  vertex  set  V 
contains  all  of  the  bodies.  To  construct  the  edge  set,  A,  consider  all  pairs 
x,y  G  V.  Since  there  are  no  occlusion  ambiguities,  one  of  x  =  y,  x  y  y,  or 
y  y  X  must  be  true.  In  the  hrst  case  no  edge  is  added,  while  the  other  two 
cases  add  directed  edges  (x,  y)  and  (y,  x)  respectively.  Consider  the  collection 
of  2D  rigid  bodies  viewed  by  a  ID  camera  illustrated  in  Fig.  3.10.  Figure  3.11 
(a)  shows  the  occlusion  graph  for  the  system  under  bounded  translations  in 
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Figure  3.10:  A  collection  of  2D  rigid  bodies  under  bounded  translational 
motion  relative  to  a  ID  camera.  Each  body  can  translate  by  AX  and  AK, 
as  shown  for  body  E. 


Figure  3.11:  (a)  Occlusion  graph  for  the  mechanism  in  Fig.  3.10,  and  (b)  the 
visibility  order  produced  by  sorting  the  graph. 
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the  plane. 

When  the  object  conhguration  admits  a  visibility  ordering,  it  can  be 
obtained  by  searching  the  occlusion  graph.  A  conhguration  that  can’t  be 
so  ordered  is  illustrated  in  Fig.  3.12.  In  general,  the  occlusion  graph  must 
be  acyclic  to  induce  a  natural  order  on  the  set  of  objects.  The  presence  of 
occlusion  cycles  is  fairly  unusual,  at  least  for  convex  bodies,  as  it  involves  a 
special  arrangement  of  spacing  and  orientation.  Cycles  don’t  occur  naturally 
in  hand  or  body  conhgurations,  for  example. 


Figure  3.12:  (a)  A  conhguration  of  three  objects  and  (b)  its  associated  cyclic 
occlusion  graph. 

When  the  occlusion  graph  is  acyclic,  it  can  be  topologically  sorted  by 
depth-hrst  search  [10]  to  produce  a  visibility  ordering.  Figure  3.11(b)  shows 
the  ordering  produced  by  the  sample  occlusion  graph.  The  sorted  graph  has 
the  property  that  all  edges  are  directed  left  to  right.  Taking  the  vertices 
in  that  order  guarantees  that  no  object  will  be  occluded  by  an  object  that 
follows  it  in  the  list. 

These  results  give  sufhcient  conditions  for  the  existence  of  a  visibility 
ordering  for  an  arbitrary  object.  Fxistence  hinges  primarily  on  the  absence 
of  occlusion  ambiguities,  which  is  determined  by  the  relative  motion  and 
the  temporal  sampling  rate.  These  results  are  useful  in  identifying  the  most 
likely  conhgurations  for  occlusion  ambiguities  in  a  known  object. 

Looking  beyond  model-based  tracking,  there  is  increasing  interest  in  lay- 
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ered  representations  for  computer  vision,  because  of  their  potential  to  sim¬ 
plify  the  3D  description  of  the  world.  Recently,  several  algorithms  have  been 
proposed  for  building  layered  descriptions  of  a  scene  from  a  single  image  or 
a  motion  sequence  [41,  11,  67].  The  results  in  this  section  provide  general 
conditions  under  which  a  layered  representation  could  be  expected  to  exist, 
for  a  given  type  of  moving  object. 

3.4.3  Occlusion  Events  and  Global  Models 

The  occlusion  graph  for  an  object  is  a  function  of  its  state.  A  discrete 
change  in  the  topology  of  the  graph  can  be  viewed  as  an  occlu.sion  event., 
analogous  to  the  visual  events  introduced  by  Koenderink  and  Van  Doom  [31]. 
These  events  partition  the  conhguration  (state)  space  into  hypervolumes  over 
which  the  occlusion  graph  is  constant.  The  state  space  partition  is  called  the 
occlusion  meta-graph  for  the  object.  The  state  partition  in  Fig.  3.2  can  now 
be  recognized  as  the  meta-graph  for  the  two  huger  model.  Moreover,  the 
visibility  ordering  rules  for  the  hand  described  in  Sec.  3.2.3  are,  in  fact, 
testing  for  occlusion  events.  These  tests  can  be  done  efficiently  in  Cartesian 
space,  in  spite  of  the  large  number  of  DOFs,  by  exploiting  the  kinematic 
model. 

The  construction  of  an  occlusion  meta-graph  for  a  two  link  planar  mech¬ 
anism  is  illustrated  in  Fig.  3.13.  The  2  DOF  state  space  is  partitioned  into 
three  types  of  regions  for  which  the  occlusion  graph  is  constant.  The  values 
of  9 1  can  wrap  around  from  —  tt  to  tt,  but  02  is  bounded  away  from  both 
extremes,  due  to  noninterpenetration.  Fach  state  is  restricted  to  an  inter¬ 
val  of  size  2X9  between  frames.  The  construction  technique  is  analogous  to 
obstacle  growing  in  the  conhguration  space  approach  to  manipulator  path 
planning  [8].  In  general,  the  hypervolumes  will  be  n  dimensional  regions 
bounded  by  curved  surfaces. 

The  two  link  meta-graph  shares  with  Fig.  3.2  the  property  that  trajec- 
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3.5  Discussion 

The  self-occlusion  of  articulated  objects  can  be  modeled  by  a  layered  template 
representation,  which  is  updated  over  time  by  means  of  a  kinematic  model. 
Layered  representations  are  constructed  from  visibility  ordered  templates, 
and  practical  ordering  algorithms  can  be  obtained  from  the  object  kinematics. 

Window  functions  mask  templates  on  the  basis  of  the  visibility  order, 
leading  to  a  direct  minimization-based  solution  to  self-occluding  motion.  By 
analyzing  the  structure  of  the  window  functions  in  the  objective  function,  a 
simple  algorithm  for  Jacobian  computation  is  obtained. 

The  existence  properties  of  the  occlusion  representation  depend  on  the 
lack  of  occlusion  ambiguities  between  pairs  of  rigid  links.  These  existence 
results  establish  the  applicability  of  the  tracking  framework  to  arbitrary  ar¬ 
ticulated  objects. 
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Chapter  4 

Hand  Tracking  Experiments 


The  kinematic  models  and  tracking  algorithms  presented  in  Chpt.  2  were  used 
to  construct  a  real-time  articulated  object  tracking  system,  called  DigitEyes. 
Two  hand  tracking  experiments  using  the  DigitEyes  system  are  reported 
in  this  chapter,  along  with  an  off-line  experiment  in  tracking  self-occluding 
motion.  These  experiments  validate  the  model-based  tracking  framework 
presented  above,  and  demonstrate  the  potential  of  3D  human  sensing,  at 
frame  rates  of  up  to  10  Hz,  using  currently  available  computer  hardware. 
All  of  these  results  are  the  first  of  their  kind,  demonstrating  real-time  high 
DOF  tracking  of  hands  using  natural  imagery,  and  with  nontrivial  amounts 
of  self-occlusion. 

The  chapter  begins  with  a  description  of  the  experimental  objectives  of 
the  DigitEyes  implementation,  followed  by  a  detailed  discussion  of  its  soft¬ 
ware  architecture.  This  architecture  made  it  possible  to  construct  on-line 
and  off-line  versions  of  the  system  from  the  same  basic  set  of  modules.  Next, 
the  computational  cost  of  hand  tracking  in  DigitEyes  is  analyzed,  and  the 
special  hardware  used  to  achieve  real-time  performance  is  discussed.  Fol¬ 
lowing  this,  the  first  real-time  experiment,  tracking  a  27  DOF  hand  model 
with  two  cameras,  is  presented.  This  result  constitutes  the  first  experimental 
demonstration  of  3D  high  DOF  tracking  of  unmarked,  unadorned  hands.  In 
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the  second  experiment,  a  simple  3D  cursor  user-interface  was  developed  and 
tested  using  the  DigitEyes  system.  Finally,  experimental  results  are  given 
for  off-line  tracking  of  two  hngers  in  the  presence  of  self-occlusions. 

4.1  Experimental  Objectives 

The  tracking  experiments  in  this  chapter  were  designed  with  two  purposes 
in  mind.  The  hrst  was  to  validate  the  model-based  tracking  framework  de¬ 
scribed  in  Chpts.  2  and  3  on  real  hand  images.  The  DigitEyes  real-time 
tracking  system  was  indispensable  in  this  task,  as  it  made  it  possible  to  con¬ 
duct  experiments  with  millions  of  images  in  a  reasonable  amount  of  time.^ 
Real-time  hand  tracking  with  one  and  two  cameras  provided  experimental 
validation  of  the  kinematic  models  and  estimation  framework,  and  a  sep¬ 
arate  off-line  tracking  experiment  tested  the  additional  representations  for 
self-occlusion.  The  second  experimental  goal  was  to  evaluate  the  potential 
usefulness  of  vision-based  hand  tracking  in  applications.  This  was  accom¬ 
plished  by  applying  the  DigitEyes  system  to  the  3D  cursor  user-interface 
problem,  described  in  Sec.  4.3.4. 

The  two  types  of  errors  that  are  important  in  tracking  are  residual  errors 
and  state  errors.  Residual  errors  measure  the  difference  between  the  input 
image  and  the  image  predicted  by  the  state  estimate,  acting  through  the 
model.  The  residuals  are  dehned  mathematically  by  Eqns.  2.10,  2.17,  and 
2.19,  and  the  state  estimate  minimizes  them  by  dehnition.  Backprojecting 
the  estimated  hand  pose  onto  its  associated  image  makes  it  possible  to  visu¬ 
ally  assess  the  degree  of  ht  between  the  estimate  and  the  measurement.  A 
qualitative  visual  agreement  between  the  back-projected  model  and  the  im¬ 
age  is  the  most  basic  requirement  for  tracking  performance,  and  is  the  basis 

^The  DigitEyes  system  was  in  daily  operation  for  over  a  year.  Assuming  that  the  system 
ran  for  an  hour  each  weekday  at  a  sampling  rate  of  10  Hz,  it  follows  that  approximately 
10  million  images  were  processed! 
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for  experimental  validation  in  this  thesis. 

State  error,  on  the  other  hand,  is  the  difference  between  the  tracker  output 
and  the  ground  truth  for  the  physical  system  being  tracked.  It  is  synonymous 
with  the  accuracy  of  the  tracker.  Determining  ground  truth  motion  for  a 
complicated  object  like  the  hand  is  extremely  difficult,  as  the  lack  of  a  good 
noninvasive  sensor  is  one  of  the  motivations  of  this  work.  Although  state 
variables  such  as  joint  angles  provide  a  compact  description  of  hand  motion, 
obtaining  ground  truth  for  them  is  probably  impractical.  The  most  promising 
ground  truth  measure,  discussed  in  more  detail  in  Sec.  6,  is  to  attach  LEDs 
to  the  hand  in  a  way  that  doesn’t  interfere  with  DigitEye.s,  and  measure  their 
absolute  spatial  position  using  stereo. 

Track  life  [34]  is  a  dynamic  property  of  the  estimator  closely  linked  to 
the  residual  error.  It  refers  to  the  length  of  time  (number  of  frames)  that 
the  tracker  remains  on  target,  as  measured  by  its  ability  to  extract  useful 
measurements  from  each  image.  All  of  the  tracking  algorithms  in  this  thesis 
use  the  projected  kinematic  model  to  segment  the  input  image  into  features 
or  templates.  Track  loss  occurs  if  the  residual  error  grows  so  large  that  the 
model  no  longer  projects  to  the  correct  parts  of  the  image.  When  track  loss 
occurs,  the  estimator  loses  correspondence  with  the  image  and  the  state  error 
can  grow  arbitrarily  large.  However,  the  residual  error  in  each  frame  may  be 
small  enough  to  prevent  track  loss,  and  yet  the  state  error  may  remain  large 
due  to  model  error  or  singularities.  Thus  track  life,  like  the  residual  error,  is 
a  weaker  criteria  than  tracking  accuracy. 

4.2  Software  Architecture 

The  DigitEye.s  system  was  designed  with  a  modular  software  architecture, 
that  makes  it  possible  to  quickly  assemble  individualized  tracking  systems 
for  both  videotaped  and  real-time  imagery,  using  both  templates  and  point 
and  line  features.  The  system  runs  on  Sun  and  SGI  workstations,  as  well 
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Figure  4.1:  Software  architecture  for  tracking  system. 


as  on  a  special  board  for  real-time  image  processing,  called  IC40.  All  of  the 
software  is  written  in  C.  The  major  components  of  the  software  architecture 
are  shown  in  Fig.  4.1.  The  interface  and  display  modules  were  written  for  an 
SGI  Indigo  2  workstation,  using  GL  and  the  FORMS  user-interface  toolkit. 
The  solver  and  image  processing  components  will  run  on  all  three  hardware 
platforms. 

The  kinematics  module  is  the  heart  of  the  system.  Its  primary  data 
structure  is  a  tree  of  link  frames  connected  by  kinematic  transforms.  The  tree 
structure  captures  the  topology  of  the  kinematic  model,  and  represents  the 
transformations  between  the  links,  along  with  their  kinematic  parameters, 
features,  and  shape  models.  The  tree  is  constructed  automatically  from 
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an  initialization  file,  such  as  the  one  in  Appendix  A,  and  is  made  up  of 
base  and  chain  nodes.  Base  nodes,  such  as  the  palm  frame  in  the  hand 
model,  have  a  spatial  transform  and  multiple  children.  Chain  nodes  have 
Denavit-Hartenberg  transforms  and  a  single  child.  Arbitrary  branched,  open, 
kinematic  chains  can  be  constructed  from  these  two  elements.  Each  node  also 
contains  a  set  of  kinematic  parameters,  divided  into  state  variables  and  hxed 
parameters.  In  a  static  node,  all  of  the  parameters  are  hxed.  Active  base 
nodes  have  seven  state  variables  and  active  chain  nodes  have  one.  Each 
node  may  contain  geometry,  in  which  case  it  has  both  feature  points  and  a 
polygonal  solid  model  dehned  with  respect  to  a  shape  frame. 

Eunctions  in  the  kinematics  module  descend  the  link  tree  recursively, 
updating  the  spatial  position  of  the  link  and  shape  frames,  and  computing 
Jacobians  with  respect  to  the  active  variables.  The  output  of  this  positioning 
operation  is  used  in  two  ways.  Eirst,  the  Jacobian  matrix  used  for  estimation 
is  built  from  columns  distributed  through  the  link  tree.  It  combines  with  the 
feature  residuals  to  form  a  linear  system.  Second,  for  display  purposes,  the 
positioned  shape  models  can  be  rendered  on  an  SGI  workstation  from  a  user- 
controlled  viewpoint. 

There  are  two  types  of  feature  modules  that  interface  with  the  same 
kinematics  module.  Eor  tracking  experiments  with  point  and  line  features, 
a  single  3D  feature  point  in  each  frame  combines  with  the  point  and  line 
residuals.  The  Jacobian  matrix  is  constructed  from  the  contributions  of  each 
of  these  points.  Eor  experiments  with  templates,  points  sampled  from  the 
template  plane  form  the  Jacobian  that  is  used  during  gradient  descent.  The 
same  basic  software  for  computing  point  Jacobians  is  used  in  both  cases. 

The  interface  gives  the  user  control  over  the  display  of  the  estimated 
model.  Since  each  shape  frame  is  positioned  using  the  estimated  state,  ren¬ 
dering  these  shapes  on  a  graphics  workstation  gives  visual  feedback  of  the 
estimator’s  performance.  Models  can  be  rendered  from  the  same  viewpoint 
as  the  calibrated  camera,  or  from  a  viewpoint  specihed  interactively  by  the 
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user.  The  3D  cues  provided  by  the  shaded  model,  rendered  from  the  cali¬ 
brated  camera  viewpoint,  make  it  possible  to  visually  gauge  registration  and 
state  errors  simultaneously.  Additional  control  over  the  solver  and  image 
processing  is  available  in  the  interactive  version  of  the  system.  The  inter¬ 
face  to  the  interactive  system  forms  the  basis  of  the  3D  cursor  application 
described  in  Sec.  4.3.4. 

4.3  Real-Time  Hand  Tracking 

The  DigitEyes  system  is  the  hrst  real-time  3D  hand  tracking  system  based  on 
video  images  of  unmarked,  unadorned  hands.  Its  successful  performance  can 
be  attributed  to  two  factors:  the  use  of  kinematic  models  to  constrain  image 
interpretation  and  ameliorate  the  effects  of  noise,  and  the  use  of  a  high  image 
sampling  rate  to  minimize  the  size  of  the  search  space,  and  make  linearized  LS 
methods  feasible.  The  achievement  of  high  image  sampling  rates  is  one  of  the 
most  challenging  system-level  issues  in  constructing  a  real-time  vision-based 
tracking  system  [2].  Its  feasibility  depends  on  two  factors:  the  computational 
requirements  of  the  estimation  problem,  and  the  delay  involved  in  getting  the 
images  into  processor  memory.  These  issues  are  taken  up  in  the  next  two 
sections.  They  are  followed  by  experimental  real-time  tracking  results  for  a 
full  hand  model  using  two  cameras,  and  a  3D  cursor  user-interface  using  a 
single  camera. 

4.3.1  The  DigitEyes  System 

The  DigitEyes  real-time  tracking  system  is  based  on  the  feature  alignment 
approach  of  Sec.  2.5.  Point  and  line  features  have  two  computational  advan¬ 
tages  that  make  real-time  tracking  possible  on  a  conventional  microprocessor: 

•  They  can  be  detected  by  searching  along  lines  in  the  image,  removing 
the  quadratic  cost  of  area-based  image  processing.  This  eliminates  the 
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need  for  special  hardware  to  do  correlations,  for  example. 

•  Each  feature  contributes  one  number  to  the  residual  vector,  and  a  col¬ 
umn  to  the  Jacobian.  This  leads  to  small  matrices  which  can  be  pro¬ 
cessed  quickly. 


This  section  describes  the  hardware  implementation  of  the  DigitEye.s  system 
and  discusses  its  computational  requirements. 


Figure  4.2:  A  single  link  tracker  is  shown  along  with  its  detected  boundary 
points.  One  slice  through  the  huger  image  of  a  huger  is  also  depicted.  Peaks 
in  the  derivative  give  the  edge  locations. 


Fast  Feature  Detection 

A  fast  feature  detection  algorithm  was  developed  for  images  without  signih- 
cant  amounts  of  self-occlusion.  It  is  based  on  searching  images  along  sh'ces, 
lines  that  are  perpendicular  to  the  projected  model  cylinder  axis.  As  a  result 
of  the  high  sampling  rate,  the  actual  huger  phalange  position  in  the  image 
will  be  close  to  the  model  projection,  and  will  be  intersected  by  several  slices. 
For  each  slice,  the  derivative  of  the  ID  image  prohle  is  computed.  Peaks  in 
the  derivative  with  the  correct  sign  correspond  to  the  intersection  of  the  slice 
with  the  huger  silhouette.  The  extracted  intensity  prohle  and  peak  locations 
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for  a  single  slice  are  illustrated  in  Fig.  4.2.  Line  fitting  to  each  set  of  two 
or  more  detected  intersections  produces  the  feature  for  the  link.  The  resid¬ 
ual  follows  as  the  perpendicular  distance  from  the  detected  feature  line  to  a 
point  on  the  base  of  the  projected  axis.  If  only  one  silhouette  line  is  detected 
for  a  given  link,  the  cylinder  radius  can  be  used  to  extrapolate  the  axis  line 
location.  Currently,  the  length  of  the  slices  (search  window)  is  hxed  by  hand. 
Finger  tip  positions  are  measured  through  a  similar  procedure. 

Computational  Requirements 

The  cost  of  computing  the  forward  kinematics,  residual  Jacobian,  and  state 
estimate  determine  the  processing  requirements  for  hand  tracking.  The  for¬ 
ward  kinematics  computation  is  a  sequence  of  matrix  multiplications  whose 
cost  is  determined  by  the  kinematic  topology.  The  computational  costs  of 
the  residual  Jacobian  and  state  estimate  are  a  function  of  the  size  of  the 
feature  and  state  spaces.  They  consist  of  the  Jacobian  matrix  computation, 
using  the  technique  of  Sec.  2.4.4,  and  a  linear  system  solution. 


Component 

Time  (ms/iter.) 

Details 

Overlay  Display 

10.0 

Forward  Kinematics 

18.0 

Feature  Detection 

46.2 

2.46  ms/link  feature 
1.85  ms/tip  feature 

State  Fstimation 

72.0 

Jacobian:  35  ms 
Linear  Solve:  37  ms 

Total  Time 

146.0 

Table  4.1:  Computational  cost  (measured  in  milliseconds)  associated  with  the 
main  components  of  hand  tracking  for  a  full  hand  model.  Overlay  display 
refers  to  drawing  model  backprojections  as  overlays  on  live  video. 


Table  4.1  shows  the  average  computation  time  for  the  components  of  the 
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hand  tracking  system  described  above.  These  costs  were  measured  for  a  full 
hand  model  of  36  link  frames,  28  states,  and  35  residuals,  running  on  a  68040 
CPU  (the  IC40  board  described  below.)  The  measurements  were  obtained 
by  timing  with  a  stopwatch.  The  total  computation  time  requirements  for 
a  single  iteration  of  the  estimator,  146  ms,  lead  to  a  sampling  rate  of  6.66 
Hz  for  the  full  hand.  For  contrast,  the  required  computation  time  was  also 
measured  for  a  6  DOF  hand  model,  in  which  the  palm  pose  was  estimated 
using  measurements  from  three  hngers.  In  this  case,  the  total  cost  was  67 
ms/iter.,  for  an  sampling  rate  of  nearly  15  Hz. 

Hardware  Architecture 

Most  modern  workstations  have  the  computational  power  required  for  real¬ 
time  hand  tracking,  as  Table  4.1  illustrates.  What  workstations  lack,  how¬ 
ever,  is  the  ability  to  transfer  images  into  working  memory  at  high  speeds, 
due  to  the  limitations  of  system  bus  bandwidths.  While  this  will  eventually 
improve,  some  specialized  hardware  is  currently  required  to  reduce  image 
transfer  time. 

The  DigitEye.s  system  is  built  around  a  special  board  for  real-time  im¬ 
age  processing,  called  IC40,  manufactured  by  Fltec,  Inc.  Fach  IC40  board 
contains  a  68040  CPU,  5  MB  of  dual-ported  RAM,  a  digitizer,  and  a  video 
generator.  The  key  feature  of  this  system  is  the  on-board  digitizer,  which  can 
write  directly  to  CPU  memory,  thereby  removing  the  bus  bottleneck  present 
in  most  workstation-based  systems.^  The  IC40  can  deliver  digitized  images 
to  the  processor  memory  at  video  rate  with  no  computational  overhead.  An¬ 
other  important  attribute  of  the  IC40  is  its  video  generator,  which  is  used 
to  overlay  backprojections  of  the  estimated  hand  conhguration  on  the  input 
video  signal.  The  overlay  makes  possible  on-line  visual  assessment  of  the 

am  grateful  to  Omead  Amidi  and  Yuji  Mesaki  for  their  help  in  obtaining  the  IC40 
and  making  it  operational. 
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VMEBus 


Figure  4.3:  The  hardware  architecture  for  the  stereo  version  of  the  DigitEyes 
hand  tracking  system. 


quality  of  the  model  registration.  Ordinary  C  code  is  cross-compiled  using 
gcc  on  a  Sun,  and  down-loaded  to  the  board  for  execution.  The  IC40  does 
not  run  any  operating  system  in  the  DigitEyes  implementation. 

In  the  single  camera  version  of  the  system,  all  image  processing  and  state 
estimation  is  done  on  the  IC40  board,  and  state  estimates  are  communicated 
to  a  Sun  workstation  over  the  VME  bus.  The  Sun  passes  the  estimated 
states  to  a  Silicon  Graphics  Indigo  2  workstation  through  a  TCP/IP  connec¬ 
tion.  The  Indigo  2  asynchronously  renders  and  displays  the  model  using  the 
estimated  state.  The  overall  system  organization  is  shown  in  Fig.  4.3. 

In  the  stereo  implementation,  there  is  an  IC40  board  for  each  camera. 
The  total  computation  is  divided  into  two  parts:  feature  extraction  and  state 
estimation.  Feature  extraction  is  done  in  parallel  by  each  board,  then  the 
extracted  features  are  passed  over  the  VMF  bus  to  the  Sun  workstation.  Both 
IC40  boards  are  memory  mapped  on  the  Sun,  and  a  simple  semaphore  is  used 
to  synchronize  feature  acquisition  between  them.  A  solver  module  running 
on  the  Sun  combines  the  two  feature  vectors,  as  described  in  Sec.  2.5.5, 
and  solves  the  resulting  linear  system  to  obtain  the  state  estimate.  Fach 
board  has  its  own  camera  model,  and  uses  it  to  compute  its  own  forward 
kinematics.  The  estimated  state  is  passed  back  to  each  board  at  the  end 


Figure  4.4:  Experimental  test  bed  for  the  DigitEyes  system. 

of  the  estimation  cycle,  and  is  used  to  reposition  the  feature  trackers.  The 
experimental  testbed  for  hand  tracking  is  depicted  in  Fig.  4.4. 

4.3.2  Algorithm  Summary 

The  feature  alignment-based  tracking  algorithm  described  in  Sec.  2.5  is  the 
basis  for  the  DigitEyes  real-time  tracking  system.  This  section  summarizes 
the  main  steps  in  the  algorithm  and  its  hxed  parameters,  along  with  error 
sources  that  impact  tracking  performance. 

Table  4.2  summarizes  the  hxed  parameters  in  the  feature  alignment  track- 
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ing  algorithm.  These  model  and  camera  parameters  are  determined  through 
the  calibration  process  of  Secs.  2.2.3  and  2.3.  The  initial  state  is  set  for  each 
application  and  the  user  is  required  to  place  their  hand  in  the  known  conhg- 
uration  prior  to  tracking.  The  sampling  rate  is  a  function  of  the  complexity 
of  the  model,  and  will  be  described  in  more  detail  below.  The  weights  for 
the  Gauss-Newton  algorithm  (see  Sec.  2.5.3)  were  set  empirically  and  used 
in  all  of  the  experiments  in  Sec.  4.3. 


Parameters 

Description 

Camera  Model 

11  extrinsic  (pose)  and  intrinsic  (image  scale 
and  origin)  parameters 

Kinematic  Model 

Joint  axes,  link  lengths,  and  anchor  points 

Initial  State 

Starting  point  for  tracking,  qo 

Sampling  Rate 

Frequency  at  which  images  are  processed 

Feature  Window  Size 

Size  of  slice  in  search  for  huger  edges,  set  at  20  pixels 

Gauss-Newton  Weights 

Stabilizes  quaternion  (1.0),  translation  (10.0),  and 
joint  angle  (1000.0)  state  estimates 

Table  4.2:  Table  of  hxed  parameters  for  feature  alignment  tracking  algorithm. 

Tracking  begins  with  the  user’s  hand  in  the  initial  conhguration.  This  is 
aided  by  overlaying  the  projected  hand  model  with  the  video  image  during  the 
positioning  stage.  Once  the  system  is  initialized,  tracking  proceeds  through 
the  following  steps: 

1.  Update  link  frame  positions  with  respect  to  the  camera,  using  the  current 
state  estimate. 

2.  Project  link  frames  into  image  through  camera  model  and  initialize 
search  windows. 


3.  Process  image  .slices  and  find  edge  points. 
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4.  Compute  line  and  tip  feature  mea.surement.s  from  edge  points. 

5.  Compute  residual  Jacobian  for  each  measured  feature. 

6.  Compute  .state  correction  through  pseudo-inverse  (Eq.  2.22)  by  solving 
the  linear  system. 

7.  Update  the  .state  estimate. 

The  algorithm  outlined  above  was  used  in  all  of  the  real-time  exper¬ 
iments  in  this  thesis.  Accurate  tracking  requires  accurate  kinematic  and 
camera  models  and  a  sufficient  number  of  iterations  of  the  estimation  al¬ 
gorithm  between  frames.  Model  accuracy  ensures  that  the  residual  minima 
will  correspond  to  the  minimum  error  state,  while  adequate  iterations  ensure 
that  the  minima  will  be  reached  for  each  frame.  Track  life  for  the  feature 
alignment  algorithm  is  determined  by  the  alignment  between  the  image  and 
model  projections  in  each  frame.  Track  loss  occurs  when  the  search  window 
constructed  around  a  projected  link  of  the  model  fails  to  contain  the  correct 
feature.  Tracking  accuracy  impacts  track  life  through  the  size  of  the  resid¬ 
ual.  The  residual  grows  with  the  distance  between  the  projected  model  and 
the  detected  features.  If  it  becomes  too  large,  features  may  lie  outside  their 
associated  search  windows.  Excessive  hand  velocity  can  also  lead  to  track 
loss,  as  the  feature  displacement  in  the  image  between  frames  may  exceed 
the  search  window  size.  This  maximum  displacement  is  determined  by  the 
hand  velocity  in  conjunction  with  the  sampling  rate. 

4.3.3  Whole  Hand  Tracking 

The  most  ambitious  tracking  experiment  attempted  with  the  DigitEyes  sys¬ 
tem  was  full  27  DOF  hand  tracking  using  two  cameras.  Two  Sony  XC-75 
cameras  were  positioned  1.5  feet  apart  with  optical  centers  verging  near  the 
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Camera  0  View  Camera  1  View 


Figure  4.5:  Three  pairs  of  hand  images  from  the  continuous  motion  estimate 
plotted  in  Figs.  4.7  and  4.8.  Fach  stereo  pair  was  obtained  automatically 
during  tracking  by  storing  every  hftieth  image  set  to  disk.  The  samples 
correspond  to  frames  49,  99,  and  149. 
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Camera  0  View  Bottom  View 


Figure  4.6:  Estimated  hand  state  for  the  image  samples  in  Fig.  4.5,  rendered 
from  the  Camera  0  viewpoint  (left)  and  a  viewpoint  underneath  the  hand 
(right). 
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middle  of  the  tracking  area,  and  intersecting  the  table  surface  at  approxi¬ 
mately  45  degrees.  They  were  both  calibrated  to  the  same  coordinate  frame, 
located  in  the  tabletop.  The  distance  from  the  cameras  to  the  tabletop  was 
approximately  hve  feet.  The  tracker  incorporated  the  full  hand  model  from 
Appendix  A.  Line  and  point  features  from  13  of  the  15  huger  phalanges 
were  employed  in  tracking.  No  features  were  extracted  from  the  proximal 
phalanges  of  the  middle  two  hngers,  due  to  the  impossibility  of  avoiding  oc¬ 
clusions  of  these  features  during  motion.  No  features  were  extracted  from  the 
palm,  due  to  a  desire  to  keep  the  feature  extraction  code  simple  and  uniform. 

Tracking  began  with  the  hand  in  a  pre-arranged  position  on  the  tabletop. 
Because  the  hand  motion  had  to  avoid  occlusions  for  successful  tracking, 
the  available  range  of  travel  was  not  large.  It  was  sufficient,  however,  to 
demonstrate  recovery  of  articulated  DOFs  in  conjunction  with  palm  motion. 
Figure  4.5  shows  sample  images,  trackers,  and  features  from  both  cameras  at 
three  points  along  a  200  frame  sequence.  The  sample  images  were  obtained 
automatically  during  tracking  by  writing  every  50th  image  to  disk.^  Figure 
4.6  shows  the  estimated  model  conhgurations  corresponding  to  these  sample 
points.  In  the  left  column,  the  estimated  model  is  rendered  from  the  cali¬ 
brated  viewpoint  of  the  hrst  camera.  In  the  right  column,  it  is  shown  from 
an  arbitrary  viewpoint,  demonstrating  the  3D  nature  of  the  tracking  result. 
State  estimates  were  logged  by  a  program  running  on  the  Sun.  The  graphical 
model  hgures  were  rendered  off-line,  using  the  logged  states. 

Close  examination  of  the  sample  images  and  backprojected  models  shows 
some  of  the  residual  error  properties  of  the  tracker.  The  hrst  thing  to  note  is 
that  the  ht  is  quite  good  overall,  indicating  the  basic  adequacy  of  both  the 
measurements  and  the  kinematic  model.  The  most  obvious  indications  of 
small  errors  are  misalignments  between  anchor  points  and  knuckle  positions, 
and  projected  and  actual  joint  centers  and  huger  tips.  A  more  interesting 

^Samples  obtained  at  50  frame  intervals  were  found  to  capture  the  most  significant 
hand  poses  during  tracking. 
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Palm  Rotation 


Palm  Translation 


Frames  (100  ms/frame) 
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Figure  4.7:  Estimated  palm  rotation  and  translation  for  motion  sequence  of 
entire  hand.  Q^  Qz  are  the  quaternion  components  of  rotation,  while 
are  the  translation.  The  sequence  lasted  20  seconds. 


error  is  visible  in  the  images  from  frame  99.  From  the  shading  cues  in  the 
images,  it  is  clear  that  the  PIP  joint‘d  on  the  fourth  (little)  huger  is  strongly 
bent.  Yet  examination  of  the  estimated  model  pose,  particularly  in  the  syn¬ 
thesized  view  from  under  the  palm,  shows  that  the  estimated  PIP  joint  angle 
is  zero,  and  the  estimator  placed  all  of  the  bending  at  the  MCP  joint. 

This  error  is  the  result  of  the  fourth  huger  being  in  a  singular  conhgura- 
tion,  in  which  none  of  the  line  features  give  information  about  its  pose.  In  this 
case,  only  the  tip  position  contains  information  about  the  degree  of  bending, 
and  the  system  is  free  to  assign  angles  among  all  three  joints  to  achieve  it. 

■^See  Fig.  2.4  for  the  joint  labels. 
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Finger  1  States 


Thumb  States 


Frames  (100  ms/frame)  Frames  (100  ms/frame) 
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Figure  4.9:  A  sample  graphical  environment  for  a  3D  mouse.  The  3D  cursor 
is  at  the  tip  of  the  “mouse  pole”,  which  sits  atop  the  ground  plane  (in  the 
foreground,  at  the  right).  The  sphere  is  an  example  of  an  object  to  be 
manipulated,  and  the  line  drawn  from  the  mouse  to  the  sphere  indicates  its 
selection  for  manipulation. 

4.3.4  3D  Mouse  User-Interface 

Hand  motion  estimated  in  real-time  by  the  DigitEye.s  system  using  a  sim- 
plihed  hand  model  was  employed  to  drive  a  3D  mouse  interface  [46,  47]. 
Figure  4.9  shows  an  example  of  a  simple  3D  graphical  environment,  consist¬ 
ing  of  a  ground  plane,  a  3D  cursor  (drawn  as  a  pole,  with  the  cursor  at  the 
top),  and  a  spherical  object  (for  manipulation.)  Shadows  generate  additional 
depth  cues.  The  interface  problem  is  to  provide  the  user  with  control  of  the 
cursor’s  three  DOFs,  and  thereby  the  means  to  manipulate  objects  in  the 
environment. 

In  the  standard  “mouse  pole”  solution  [71],  the  3D  cursor  position  is 
controlled  by  clever  use  of  a  standard  2D  physical  mouse.  Normal  mouse 
motion  controls  the  base  position  of  the  pole  on  the  ground  plane.  Depressing 
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one  of  the  mouse  buttons  switches  reference  planes,  causing  mouse  motion 
in  one  direction  to  control  the  pole  (cursor)  height.  By  switching  between 
planes,  the  user  can  place  the  cursor  arbitrarily.  Commanding  continuous 
motion  with  this  interface  is  awkward,  however,  and  tracing  an  arbitrary, 
smooth  space  curve  is  nearly  impossible.  DigitEyes  was  used  to  develop  a 
3D  virtual  mouse,  that  permitted  simultaneous  hand-based  control  of  the 
cursor’s  DOFs. 

This  application  of  the  DigitEyes  system  served  two  purposes.  First,  it 
provided  a  qualitative  test  of  the  system’s  ability  to  recover  3D  information 
using  a  single  image  sequence.  Second,  it  demonstrated  the  capability  of  the 
tracking  framework  to  provide  adequate  sensing  for  a  practical  application. 
Fxperience  with  the  interface  suggests  areas  for  future  improvement  of  the 
system. 

In  the  DigitEyes  solution  to  the  3D  mouse  problem,  the  3  input  DOFs 
are  derived  from  a  partial  hand  model,  which  consists  of  the  hrst  and  fourth 
hngers  of  the  hand,  along  with  the  thumb.  The  palm  is  constrained  to  lie  in 
the  plane  of  the  table  used  in  the  interface,  and  thus  has  3  DOF.  The  hrst 
huger  has  3  articulated  DOFs,  while  the  fourth  huger  and  thumb  each  have 
a  single  DOF  allowing  them  to  rotate  in  the  plan  of  the  table  (abduct).  The 
hand  model  is  illustrated  in  Fig.  4.10.  A  single  camera  oriented  at  approxi¬ 
mately  45  degrees  to  the  table  top  acquires  the  images  used  in  tracking.  The 
palm  position  in  the  plane  controls  the  base  position  of  the  pole,  while  the 
height  of  the  index  huger  above  the  table  controls  the  height  of  the  cursor. 
This  particular  mapping  has  the  important  advantage  of  decoupling  the  con¬ 
trolled  DOFs,  while  making  it  possible  to  operate  them  simultaneously.  For 
example,  the  user  can  change  the  pole  height  while  leaving  the  base  position 
constant.  The  fourth  huger  and  thumb  have  abduction  DOFs  in  the  plane, 
and  are  used  as  “buttons”.  The  cost  of  estimating  the  reduced  hand  model 
was  measured  at  96.4  ms/iter.  by  timing  with  a  stopwatch  (see  Sec.  4.3.1.) 
This  gives  an  estimation  rate  of  10  Hz. 
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Figure  4.10:  The  hand  model  used  in  the  3D  mouse  application  is  illustrated 
for  frame  200  in  the  motion  sequence  from  Fig.  4.12.  The  vertical  line  shows 
the  height  of  the  tip  above  the  ground  plane.  The  input  hand  image  (frame 
200)  demonstrates  the  huger  motion  used  in  extending  the  cursor  height. 


Figures  4.11  -  4.13  give  experimental  results  from  a  500  frame  motion 
sequence  in  which  the  estimated  hand  state  was  used  to  drive  the  3D  mouse 
interface.  Figures  4.11  and  4.12  show  the  estimated  hand  state  for  each  frame 
in  the  image  sequence.  Frames  were  acquired  at  100  ms  sampling  intervals. 
The  pole  height  and  base  position  derived  from  the  hand  state  by  the  3D 
mouse  interface  are  also  depicted  in  Fig.  4.12.  The  motion  sequence  has  four 
phases.  In  the  hrst  phase  (frame  0  to  150),  the  user’s  huger  is  raised  and 
lowered  twice,  producing  two  peaks  in  the  pole  height,  with  a  small  variation 
in  the  estimated  pole  position.  Second,  around  frame  150  the  huger  is  raised 
again  and  kept  elevated,  while  the  thumb  is  actuated,  as  for  a  “button  event”. 
The  actuation  period  is  from  frame  150  to  frame  200,  and  results  in  some 
change  in  the  pole  height,  but  negligible  change  in  pole  position.  Third, 
from  200  to  350,  the  pole  height  is  held  constant  while  the  pole  position  is 
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Finger  1  States 
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Palm  Translation 


Mouse  Pole  Interface 


Figure  4.12:  Translation  states  for  mouse  pole  hand  model  are  given  on  the 
left.  The  Y  axis  motion  is  constrained  to  zero  due  to  tabletop.  On  the  right 
are  the  mouse  pole  states,  derived  from  the  hand  states  through  scaling  and  a 
coordinate  change.  The  sequence  events  goes:  0-150  huger  raise/lower,  150- 
200  thumb  actuation  only,  200-350  base  translation  only,  350-500  combined 
3  DOF  motion. 

face: 

•  Sampling  rate 

•  Sensitivity 

•  Latency 

The  quality  of  the  interface  as  a  whole  seemed  to  depend  on  another  set  of 
three  properties,  which  are  closely  linked  to  the  tracker  attributes  above. 


•  Maximum  hand  speed 
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Figure  4.13:  The  mouse  pole  cursor  at  six  positions  during  the  motion  se¬ 
quence  of  Fig.  4.11.  The  pole  is  the  vertical  line  with  a  horizontal  shadow, 
and  is  the  only  thing  moving  in  the  sequence.  Samples  were  taken  at  frames 
0,  30,  75,  260,  300,  and  370  (chosen  to  illustrate  the  range  of  motion). 
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•  Transient  DOF  coupling 

•  Resolution 

To  illustrate  the  impact  of  the  tracker’s  performance  on  an  application, 
each  of  the  above  issues  is  examined  in  turn.  The  maximum  possible  speed 
of  the  user’s  hand  across  the  tabletop  is  a  function  of  the  sampling  rate  of 
the  estimation  algorithm,  in  relation  to  the  error  surface  properties  of  the 
residual.  In  the  specihc  case  of  the  virtual  mouse  interface,  the  tracker  could 
tolerate  hand  motions  of  about  2.5  in/sec  before  track  loss  began.  This  was 
measured  experimentally  by  timing  repeated  hand  translations  in  the  plane, 
keeping  the  tracker  on  the  edge  of  convergence  by  observing  the  real-time 
overlay  of  the  backprojected  model  and  images. 

Transient  coupling  between  DOFs  is  a  second  factor  that  is  affected  by 
the  sampling  rate.  State  coupling  is  a  natural  consequence  of  the  kinematic 
constraints  which  make  tracking  possible.  These  constraints  lead  to  transient 
effects  in  the  estimator,  however,  that  can  negatively  impact  performance. 
An  example  of  transient  coupling  occurs  around  frame  150  in  the  Button 
State  and  Mouse  Pole  Interface  plots  from  Figs.  4.11  and  4.12.  When  the 
thumb  is  actuated  for  a  button  event,  the  pole  height  drops  initially,  and 
then  rises  back  to  its  previous  level  over  the  course  of  about  20  frames. 
This  behavior  is  the  result  of  an  initial  tendency  of  the  estimator  to  spread 
residual  error  over  all  of  the  states  that  can  reduce  it.  Only  after  the  thumb 
has  had  time  to  rotate,  and  absorb  most  of  its  residual  error,  are  the  other 
residuals  able  to  reassert  their  control  over  their  own  DOFs.  The  duration 
of  these  transient  effects  is  primarily  a  result  of  the  sampling  rate.  More 
iterations/sec.  make  the  estimator  “stiffer,”  and  reduce  the  effect  of  these 
disturbances.  Interestingly,  very  similar  experimental  observations  have  be 
made  in  the  domain  of  robot  control  [30]. 

The  last  property  of  the  interface,  the  resolution  with  which  the  cursor 
position  can  be  controlled,  is  largely  a  function  of  the  estimator  sensitivity. 
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As  described  in  Sec.  2.5.3,  the  sensitivity  of  a  state  varies  with  position  in 
the  state  space.  The  large  scale  effect  of  this  is  that  the  ease  of  use  of  the 
interface  depends  strongly  on  the  palm  orientation  relative  to  the  camera. 
Consider  rotating  the  palm  on  the  table  as  the  pole  height  is  varied.  The 
orientation  at  which  the  plane  of  the  huger  contains  the  camera  is  a  singular 
conhguration,  and  pole  height  becomes  extremely  difficult  to  measure.  The 
sensitivity  of  the  estimator  to  the  huger  motion  decreases  as  this  singularity 
is  approached.  The  ehective  resolution  in  the  cursor  position  is  determined 
by  the  state  sensitivity.  The  more  sensitive  the  state,  the  larger  the  range 
of  image  displacements  that  are  produced  by  a  given  range  of  state  space 
motion.  This  in  turn  leads  to  a  larger  resolution  in  state  space,  and  greater 
ability  to  control  the  cursor  at  a  hue  level  of  detail. 

The  ehects  of  latency  were  not  studied  in  detail  for  the  virtual  3D  mouse 
problem,  as  they  were  not  extremely  signihcant.  Latency  refers  to  the  time 
delay  between  hand  motion  and  the  response  of  the  interface.  Long  latency 
times  make  control  of  the  interface  impossible.  As  a  result  of  the  virtual  3D 
mouse  interface  design,  the  total  latency  was  determined  by  the  estimator 
cycle  time,  the  communication  delay  to  the  Indigo  2,  and  the  model  rendering 
time.  These  last  two  additional  effects  added  around  30  ms  to  the  100  ms 
cycle  time.  The  effect  of  the  total  latency  was  noticeable,  but  did  not  make 
the  cursor  uncontrollable. 


4.4  Tracking  Self-Occluding  Hand  Motion 

The  representations  and  algorithms  for  self-occluding  motion  described  in 
Chpt.  3  were  implemented  in  an  off-line  version  of  the  DigitEyes  system.  This 
section  gives  a  complete  summary  of  the  resulting  tracking  algorithm,  and 
presents  experimental  results  for  a  two  huger  motion  sequence  with  signihcant 
self-occlusion. 
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4.4.1  Algorithm  Summary 

The  template-based  tracking  algorithm  for  self-occluding  motion  described  in 
Chpt.  3  is  summarized  along  with  a  discussion  of  the  primary  error  sources. 
Table  4.3  lists  the  hxed  parameters  in  the  template-based  tracking  algorithm. 
In  addition  to  the  camera  and  kinematic  models,  a  template  model  must  be 
specihed  for  each  link.  These  templates  are  obtained  manually  from  a  set  of 
reference  images  before  tracking  begins. 


Parameters 

Description 

Camera  Model 

11  extrinsic  (pose)  and  intrinsic  (image  scale 
and  origin)  parameters 

Kinematic  Model 

Joint  axes,  link  lengths,  and  anchor  points 

Template  Model 

Sufhcient  views  for  each  link  in  object 

Initial  State 

Starting  point  for  tracking,  qo 

Sampling  Rate 

Frequency  at  which  images  are  processed 

Step  size 

Scales  state  correction  in  gradient-descent  algorithm. 

Table  4.3:  Table  of  hxed  parameters  for  feature  alignment  tracking  algorithm. 

Tracking  begins  with  the  user’s  hand  in  the  initial  conhguration.  This  is 
aided  by  overlaying  the  projected  hand  model  with  the  video  image  during  the 
positioning  stage.  Once  the  system  is  initialized,  tracking  proceeds  through 
the  following  steps: 

1.  Update  link  frame  positions  with  respect  to  the  camera,  using  the  current 
state  estimate. 

2.  Project  link  templates  into  image  through  camera  model. 

3.  Segment  image  pixels,  assigning  them  to  templates. 

4.  Compute  residual  and  Jacobian  for  each  segmented  pixel. 
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5.  Compute  state  correction  through  gradient- descent  minimization  (Eq.  2.12.) 

6.  Update  the  .state  estimate. 

The  accuracy  of  the  tracking  algorithm  depends,  as  in  Sec.  4.3.2,  on  the 
accuracy  of  the  kinematic  and  camera  models.  In  addition,  errors  in  the  tem¬ 
plate  models,  such  as  unexpected  shading  variations,  can  cause  the  minimum 
residual  state  to  differ  from  the  correct  state,  degrading  the  tracking  accu¬ 
racy.  A  sufficient  number  of  iterations  of  the  estimation  algorithm  between 
frames  is  also  required  for  accurate  tracking.  Track  life  in  the  template-based 
algorithm  is  determined  by  the  shape  of  the  state  space  error  surface,  which 
is  minimized  during  estimation.  For  each  image,  there  is  a  region  of  conver¬ 
gence  (ROC)  centered  around  the  minimum  residual  state.  Track  loss  occurs 
if  the  starting  point  for  minimization  lies  outside  this  ROC  in  any  frame. 
This  could  happen  as  a  result  of  errors  in  the  camera,  kinematic,  or  tem¬ 
plate  models.  It  could  also  occur  if  the  hand  velocity  between  frames  is  too 
large,  resulting  in  a  state  displacement  outside  of  the  ROC.  The  maximum 
state  displacement  is  determined  by  the  hand  velocity  in  conjunction  with 
the  sampling  rate. 

In  addition  to  the  basic  requirements  for  template-based  tracking  de¬ 
scribed  above,  there  are  four  necessary  conditions  for  tracking  self-occluding 
objects: 

1.  There  are  no  points  in  the  state  space  where  the  occlusion  properties 
change  instantaneously.  This  ensures  that  all  regions  of  occlusion  are 
separated  by  disjoint  regions. 

2.  The  sampling  rate  is  high  enough  to  prevent  occlusion  ambiguities.  The 
product  of  the  sampling  rate  and  maximum  state  velocity  must  be  less 
than  the  minimum  distance  through  a  disjoint  region  of  the  state  space. 
When  this  condition  is  met,  the  disjoint  regions  can  be  grown  by  the 
motion  interval  without  bringing  them  into  contact. 
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3.  The  composite  Jacobian  formed  from  all  camera  viewpoints  must  be 
full  rank.  This  ensures  that  measurements  are  available  for  each  state 
in  the  linearized  system.  If  this  condition  is  not  met,  there  will  be  no 
estimates  for  some  states,  and  any  occlusion  properties  that  depend 
on  these  states  cannot  be  determined.  This  could  cause  the  visibility 
order  prediction  to  fail. 

4.  The  occlusion  graph  for  the  tracked  object  must  be  acyclic  at  all  points 
in  the  state  space. 

If  these  conservative  requirements  are  met,  the  prediction  of  the  visibility 
order  will  succeed  for  all  possible  motions  of  the  object.  In  practice,  there 
may  be  points  in  the  state  space  where  one  or  more  of  these  conditions  are 
violated.  For  example,  occlusion  ambiguities  arise  in  hand  tracking  during  a 
gesture  like  “stop”,  as  discussed  in  Sec.  3.4.1.  In  practice  it  may  be  necessary 
to  use  other  information,  such  as  the  velocity  of  the  object,  to  disambiguate 
these  cases. 

4.4.2  Two  Finger  Tracking  Results 

The  main  representations  and  algorithms  for  self-occluding  motion  described 
in  Chpt.  3  have  been  implemented  in  an  off-line  version  of  the  DigitEyes 
system.  This  section  presents  the  hrst  experimental  results  using  this  tracker, 
for  a  two  huger  motion  sequence  with  signihcant  self-occlusion,  depicted  in 
Fig.  4.14. 

In  the  sequence,  my  index  huger  curls  into  my  palm  while  my  hand  and 
remaining  hngers  are  held  still.  An  80  frame  sequence  was  digitized  from 
videotape  and  sampled  for  an  ehective  frame  rate  of  approximately  15  Hz. 
This  resulted  in  an  average  huger  tip  displacement  between  frames  of  about 
three  pixels.  The  camera  was  positioned  at  approximately  45  degrees  to  the 
table  top,  facing  the  palm.  As  a  result  of  this  camera  position,  the  hrst  huger 
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Figure  4.14:  Sample  input  images  and  associated  state  estimates  for  frames  0, 
13,  30,  and  75  in  the  motion  sequence.  The  two  huger  hand  model  is  rendered 
with  respect  to  the  calibrated  camera  model  using  the  estimated  state.  The 
overlays  show  the  template  boundaries  and  projection  of  cylinder  center  axes. 
These  frames  were  selected  for  their  representative  self-occlusions. 
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Figure  4.16:  Estimated  translation  state  of  two  finger  model. 
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error,  photometric  template  model  error,  and  template  shape  error.  Of  these, 
the  last  two  were  noticeable  in  the  residual,  and  warrant  further  study. 

These  results  demonstrate  the  potential  of  the  direct  template  registra¬ 
tion  approach  to  tracking  self-occluding  objects.  From  a  classical  feature 
detection  perspective,  the  images  in  the  sequence  are  quite  difficult.  All  of 
the  phalanges  of  the  middle  huger  are  partially  occluded  during  some  por¬ 
tion  of  the  motion  sequence,  and  the  index  huger  is  silhouetted  against  the 
hngers  and  palm  for  most  of  its  motion.  A  signihcant  advantage  of  the 
window-based  approach  is  that  it  can  tolerate  any  amount  of  occlusion  and 
continue  to  extract  useful  information  from  the  pixels  that  are  visible.  The 
successful  tracking  of  this  complicated  motion  testihes  again  to  the  power  of 
the  kinematic  model  in  constraining  the  interpretation  of  the  image. 

All  of  the  experiments  in  this  chapter  employed  a  black  cloth  backdrop 
to  ensure  high  contrast  between  the  hand  and  its  background.  Invariance  to 
background  was  not  addressed,  as  it  is  believed  to  be  less  important  than 
the  kinematic  and  self-occlusion  issues  which  were  the  focus  of  this  thesis.  In 
practice,  applications  can  be  designed  with  a  constrained  background,  as  the 
3D  virtual  mouse  interface  demonstrates.  However,  a  background  template 
can  be  added  to  the  framework  tested  in  this  section,  making  it  possible  to 
exploit  a  hxed  background  image  in  tracking. 

4.5  Summary 

This  chapter  describes  experimental  results  in  hand  tracking,  both  in  real¬ 
time  and  using  off-line  image  sequences.  Two  algorithms  were  tested:  the  line 
and  point  feature-based  algorithm  of  Sec.  2.5,  and  the  layered  template  algo¬ 
rithm  for  self-occluding  motion  described  in  Chpt.  3.  The  presented  results 
include  the  hrst  experimental  demonstration  of  27  DOF  visual  tracking,  and 
the  hrst  tracking  results  for  articulated  motion  with  signihcant  amounts  of 
occlusion.  All  experiments  were  conducted  with  natural  images  of  unmarked 
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hands. 


Chapter  5 

Previous  Work 


Research  in  human  motion  analysis  spans  a  wide  variety  of  disciplines  from 
biomechanics  to  human-computer  interaction  and  virtual  reality,  from  com¬ 
puter  vision  to  computer  graphics.  Previous  work  in  this  area  can  be  clas- 
sihed  along  three  overlapping  lines.  The  hrst  body  of  work,  which  includes 
this  thesis,  is  concerned  with  3D  analysis  of  human  motion.  It  is  distinct 
from  a  second  body  of  work  in  2D  gesture  recognition.  This  work  is  con¬ 
cerned  solely  with  the  mapping  from  image  sequences  to  a  set  of  discrete 
classes.  In  principle,  the  3D  tracking  approach  can  be  applied  to  this  prob¬ 
lem  as  well  [15],  but  work  in  this  area  often  takes  a  learning  approach  and 
tries  to  avoid  3D  model  specihcation.  A  third  body  of  work  develops  special 
purpose  algorithms  for  human  sensing  applications.  It  is  not  concerned  with 
developing  general  frameworks,  as  is  the  case  in  the  previous  two  areas. 


5.1  3D  Motion  Analysis 

In  3D  tracking  approaches,  a  model  of  the  articulated  object  is  employed  to 
constrain  image  interpretation  [50,  48,  47,  29,  32,  39,  33,  15,  72,  44,  23,  42]. 
A  second  class  of  3D  analysis  problems  attempt  to  recover  both  3D  structure 
and  motion  (or  pose)  simultaneously  [24,  57,  68,  45].  This  latter  class  is  a 
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significant  departure  from  the  approach  of  this  thesis,  and  won’t  be  consid¬ 
ered  in  detail.  Since  none  of  the  articulated  tracking  work  prior  to  this  thesis 
dealt  explicitly  with  self-occlusions,  the  comparison  in  these  sections  will  be 
concerned  only  with  the  use  of  kinematics  to  provide  geometric  constraints 
in  tracking.  Connections  between  Chpt.  3  and  other  work  on  occlusion  are 
described  later. 

The  two  earliest  systems  for  visual  human  motion  analysis,  by  O’Rourke 
and  Badler  [42]  and  David  Hogg  [23],  approached  model-based  recovery  of 
human  motion  using  the  respective  AI  search  techniques  of  constraint  prop¬ 
agation  and  heuristic  search  of  a  discretized  state  space.  Both  works  stand 
out  in  the  complexity  of  the  model  constraints  they  applied  to  the  tracking 
problem.  O’Rourke’s  system  was  capable  of  incorporating  occlusion  and  rigid 
body  noninterpenetration  constraints  in  pose  determination.  Hogg’s  system 
also  included  postural  models. 

The  treatment  of  occlusion  constraints  in  O’Rourke’s  system  provides 
an  interesting  complement  to  their  role  in  this  thesis.  In  the  example  in 
his  paper,  the  onset  of  occlusion  is  detected  at  the  image  level,  and  then 
the  model  is  positioned  so  as  to  achieve  the  occlusion  constraint.  In  that 
particular  case,  the  prediction  of  occlusion  from  the  model  would  be  very 
difficult,  as  the  arm  is  in  a  singular  conhguration.  One  could  imagine  using 
the  occlusion  prediction  mechanism  in  Chpt.  3  to  rule  out  unlikely  occlusion 
events,  reducing  the  number  of  possibilities  that  had  to  be  searched. 

Hogg’s  work  is  probably  the  most  relevant  to  this  thesis,  as  it  dealt  with 
motion  explicitly  and  presented  results  for  an  image  sequence  of  a  walking 
hgure  that  are  still  impressive  by  today’s  standards.  From  a  conceptual 
viewpoint,  the  two  biggest  distinctions  between  this  work  and  Hogg’s  lie  in 
the  kinematic  representation  and  search  method.  Robotic  kinematic  models 
provide  powerful  tools  for  converting  articulated  tracking  into  a  continuous 
estimation  problem.  These  techniques  make  it  possible  to  handle  a  much 
larger  state  space  and  integrate  kinematic  constraints  and  image  interpreta- 
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tion  directly. 

The  idea  of  applying  robotic  kinematic  models  to  human  motion  tracking 
was  hrst  proposed  by  Yamamoto  and  Koshikawa  [72].  They  presented  2D 
tracking  results  for  a  three  DOF  system  of  a  human  arm  and  torso.  This 
work  extends  the  tracking  framework  in  their  paper  signihcantly  in  several 
directions:  explicit  presentation  of  robot  kinematic  models  based  on  DH 
notation,  analysis  of  singular  conhgurations,  3D  real-time  tracking  results, 
application  to  a  user-interface  domain,  and  high  DOF  tracking.  Their  more 
recent  publications  [29,  32]  present  off-line  3D  tracking  results  using  two 
cameras,  following  the  approach  to  integrating  multiple  views  described  in 
Sec.  2.5.5.  These  results  are  very  interesting,  as  they  provide  evidence  that 
the  techniques  in  this  thesis  are  applicable  to  body  tracking  as  well. 

Works  in  the  area  of  physics-based  modeling  have  also  addressed  articu¬ 
lated  body  motion  [65,  39,  44].  One  of  the  applications  of  deformable  models 
presented  in  [65]  is  3D  tracking  of  a  single  huger  from  a  stereo  image  se¬ 
quence.  Pentland  and  Horowitz  [44]  give  an  example  of  tracking  the  motion 
of  a  human  hgure  using  optical  how  and  an  articulated  deformable  model.  In 
a  related  approach,  Metaxis  and  Terzopoulos  [39]  track  articulated  motion 
using  deformable  superquadric  models. 

Although  most  researchers  working  in  the  gesture  recognition  area  have 
pursued  2D  approaches,  there  are  a  few  works  that  investigate  3D  analy¬ 
sis  [15,  33].  Dorner  describes  a  system  for  interpreting  American  Sign  Lan¬ 
guage  from  image  sequences  of  a  single  hand  in  [15].  In  her  system,  the  user 
wears  a  glove  with  diherent  colors  to  aid  in  huger  segmentation.  Another  3D 
approach  based  on  wearing  gloves  with  hducial  points  is  described  in  [33]. 

5.2  2D  Gesture  Analysis 

There  has  been  a  large  amount  of  work  on  applying  static  and  dynamic 
gesture  recognition  approaches  to  hand  imagery.  Three  representative  works 
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based  on  learning  approaches  are  [19,  12,  56].  In  [12],  Darrell  and  Pentland 
describe  a  system  for  learning  and  recognizing  dynamic  hand  gestures.  Their 
approach  tries  to  avoid  explicit  models  by  building  a  library  of  template 
models  on-line.  Work  by  Segen  [56]  takes  a  neural  network  approach  to  2D 
hand  gesture  recognition.  Some  sample  interfaces  based  on  gestural  control 
of  computer  graphics  models  are  described.  Freeman  describes  a  gesture 
recognition  system  based  on  orientation  histograms  in  [19].  All  of  these 
systems  obtain  real-time  performance. 

Although  many  frameworks  for  human  motion  analysis  are  possible,  an 
approach  based  on  full-state  3D  tracking  has  four  main  advantages.  First,  by 
tracking  all  of  the  hand’s  DOFs,  the  end-user  is  provided  with  the  maximum 
possible  flexibility  for  interface  applications.  (See  [61,  27]  for  examples  of 
interfaces  requiring  a  whole-hand  sensor.)  In  addition,  a  general  modeling 
approach  based  on  3D  kinematics  makes  it  possible  to  track  any  subset  of 
hand  or  body  states  with  the  same  basic  algorithm.  Another  beneht  of  full 
state  tracking  is  invariance  to  unused  hand  motions.  The  motion  of  a  par¬ 
ticular  huger,  for  example,  can  be  recognized  from  its  joint  angles  regardless 
of  the  pose  of  the  palm  relative  to  the  camera.  Finally,  modeling  the  hand 
kinematics  in  3D  eliminates  the  need  for  application-  or  viewpoint-dependent 
user  modeling. 

5.3  Application-Specific  Human  Sensing 

Many  authors  have  used  hand  and  body  images  to  test  2D  tracking  and  regis¬ 
tration  algorithms.  Many  of  these  approaches  are  applicable  to  user  interface 
or  surveillance  domains.  A  glove-based  approach  that  uses  motion  parallax 
to  control  a  graphical  environment  is  described  in  [9].  In  the  domain  of  hu¬ 
man  hgures,  two  approaches  to  2D  tracking  are  [5,  25].  Huttenlocher  et.  ah 
have  applied  the  Hausdorff  distance  measure  to  register  images  of  moving 
people  [25].  In  a  related  effort,  Baumberg  and  Hogg  [5]  describe  a  real-time 
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pedestrian  tracking  system  based  on  active  shape  models.  Approaches  to  hu¬ 
man  motion  analysis  based  on  more  invasive  approaches,  such  as  mechanical 
sensors  [74,  7]  or  active  targets  [37],  have  a  long  history. 

5.4  Layered  Representations 

The  layered  representation  for  self-occlusion  presented  in  Chpt.  3  is  related  to 
other  work  in  tracking  and  motion  coding.  Layered  representations  based  on 
clustering  optical  flow  are  presented  in  [1,  11,  67].  This  work  is  largely  con¬ 
cerned  with  automatically  generating  layered,  velocity-based  representations 
of  a  motion  sequence  that  could  serve  as  a  model  for  coding  or  recognition. 
A  coding  approach  based  on  global  image  models  is  presented  in  [26].  A 
layered  representation  based  on  the  occluding  contours  of  a  single  image  is 
described  in  [41].  These  works  are  complementary  to  the  approach  in  this 
thesis,  which  is  concerned  with  making  the  best  use  of  available  models.  In 
addition,  the  kinematic  representation  of  self-occlusions  is  a  generalization 
of  layered  representations  based  on  depth  ordering  in  the  scene,  since  it  is 
designed  to  exploit  orderings  within  conhguration  space. 

As  a  result  of  modeling  self-occlusion  in  the  image  plane,  tracking  can 
be  formulated  as  a  direct  optimization  problem  over  an  image-based  residual 
error.  The  approach  of  coupling  the  image  interpretation  (feature  detection) 
problem  directly  to  the  model  was  popularized  by  deformable  models  [65] 
(including  2D  Snakes  [28])  and  has  since  been  applied  to  a  variety  of  other 
domains  [51,  73]. 
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Chapter  6 

Conclusion  and  Future  Work 


A  vision-based  sensor  can  provide  a  passive,  noninvasive  solution  to  human 
motion  tracking  problems,  since  it  can  be  located  in  the  user’s  environment 
rather  than  on  their  person.  To  achieve  these  goals,  computer  vision  algo¬ 
rithms  have  been  developed  that  can  estimate  3D  articulated  motion  from 
ordinary  intensity  images  of  unmarked  hands  or  bodies  at  video  rates. 

This  dissertation  has  presented  new  results  in  applying  kinematic  models 
to  articulated  object  tracking.  By  adopting  the  representations  and  tools  of 
robotics,  powerful  computational  and  analytic  tools  are  brought  to  bear  on 
the  visual  tracking  problem.  Using  these  techniques,  kinematic  models  are 
developed  for  the  hand  and  incorporated  into  a  real-time  tracking  system 
called  DigitEyes.  The  kinematic  model  plays  an  additional  role  in  predicting 
visibility  orders  for  tracking  self-occluding  motion.  The  resulting  tracking 
algorithms  were  tested  on  natural  hand  image  sequences  and  applied  to  a  3D 
mouse  user-interface  problem.  These  experiments  demonstrate  the  potential 
of  3D  visual  human  sensing. 


Future  Work 


•  Hand  model  calibration  could  be  accomplished  on-line  by  adapting 
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fixed  model  parameters  in  an  estimation  loop  with  a  longer  time  con¬ 
stant  than  state  estimation.  Model  parameters  would  be  initialized  to 
standard  values,  and  after  a  period  of  adaptation  would  conform  to  any 
user. 

•  A  real-time  implementation  of  the  self-occlusion  handling  tracker  would 
be  extremely  interesting.  Such  a  system  would  allow  unconstrained 
hand  motion  during  tracking  for  the  hrst  time.  Such  an  approach 
should  be  employed  with  multiple  cameras  to  ensure  accurate  estima¬ 
tion. 

•  Ground  truth  3D  hand  data  should  be  obtained  to  measure  the  absolute 
accuracy  of  the  tracker  as  a  function  of  the  number  of  cameras.  An 
initial  strategy  is  to  attach  LEDs  to  the  palm  and  track  the  six  DOF 
palm  motion  using  the  hngers.  This  would  avoid  interference  between 
the  two  sensors  and  provide  a  base  accuracy  assessment. 

•  Alternative  window  functions  should  be  investigated  and  their  effect 
on  tracker  performance  should  be  analyzed. 

•  It  would  be  interesting  to  combine  the  top-down  occlusion  prediction 
from  the  kinematic  model  with  a  bottom-up  occlusion  analysis  stage  in 
a  synergistic  approach. 

•  Applications  of  the  DigitEyes  sensor  to  graphics,  puppetry,  and  user- 
interface  applications  should  be  developed  to  improve  our  understand¬ 
ing  of  the  necessary  performance  level  for  real  applications. 


Appendix  A 

Whole  Hand  DH  Model 


The  next  page  gives  the  full  DH  model  for  my  right  hand,  which  was  used  in 
all  of  the  experiments  in  this  thesis. 


115 


116 


APPENDIX  A.  WHOLE  HAND  DH  MODEL 


Frame 

Geometry 

e 

d 

a 

a 

shape  (in  mm) 

Next 

0 

Palm 

0.0 

0.0 

0.0 

0.0 

X  56.0,  y  86.0,  z  15.0 

1  8  15  22  29 

1 

TT /2 

0.0 

38.0 

— 7r/2 

2 

2 

0.0 

-31.0 

0.0 

7r/2 

3 

3 

qr 

0.0 

0.0 

7r/2 

4 

4 

Finger  1  Link  0 

qs 

0.0 

45.0 

0.0 

Rad  10.0 

5 

5 

Finger  1  Link  1 

qg 

0.0 

26.0 

0.0 

Rad  10.0 

6 

6 

Finger  1  Link  2 

qio 

0.0 

24.0 

0.0 

Rad  9.0 

7 

7 

Finger  1  Tip 

0.0 

0.0 

0.0 

0.0 

Rad  9.0 

- 

8 

TT /2 

0.0 

37.0 

— 7r/2 

9 

9 

0.0 

-9.0 

0.0 

7r/2 

10 

10 

qii 

0.0 

0.0 

7r/2 

11 

11 

Finger  2  Link  0 

qi2 

0.0 

56.0 

0.0 

Rad  10.0 

12 

12 

Finger  2  Link  1 

qi3 

0.0 

27.0 

0.0 

Rad  10.0 

13 

13 

Finger  2  Link  2 

qi4 

0.0 

22.0 

0.0 

Rad  9.0 

14 

14 

Finger  2  Tip 

0.0 

0.0 

0.0 

0.0 

Rad  7.0 

- 

15 

TT /2 

0.0 

33.0 

— 7r/2 

16 

16 

0.0 

6.0 

0.0 

7r/2 

17 

17 

qi5 

0.0 

0.0 

7r/2 

18 

18 

Finger  3  Link  0 

qi6 

0.0 

53.0 

0.0 

Rad  9.0 

19 

19 

Finger  3  Link  1 

qi7 

0.0 

25.0 

0.0 

Rad  9.0 

20 

20 

Finger  3  Link  2 

qi8 

0.0 

20.0 

0.0 

Rad  8.0 

21 

21 

Finger  3  Tip 

0.0 

0.0 

0.0 

0.0 

Rad  7.0 

- 

22 

TT /2 

0.0 

30.0 

— 7r/2 

23 

23 

0.0 

26.0 

0.0 

7r/2 

24 

24 

qi9 

0.0 

0.0 

7r/2 

25 

25 

Finger  4  Link  0 

q2o 

0.0 

38.0 

0.0 

Rad  9.0 

26 

26 

Finger  4  Link  1 

q2i 

0.0 

19.0 

0.0 

Rad  8.0 

27 

27 

Finger  4  Link  2 

q22 

0.0 

17.0 

0.0 

Rad  7.0 

28 

28 

Finger  4  Tip 

0.0 

0.0 

0.0 

0.0 

Rad  6.0 

- 

29 

—tt/2 

15.0 

43.0 

— 7r/2 

30 

30 

—  TT 

38.0 

0.0 

0.0 

31 

31 

q23 

0.0 

0.0 

7r/2 

32 

32 

Thumb  Link  0 

q24 

0.0 

46.0 

— 7r/2 

Rad  14.0 

33 

33 

q25 

0.0 

0.0 

7r/2 

34 

34 

Thumb  Link  1 

q25 

0.0 

34.0 

0.0 

Rad  10.0 

35 

35 

Thumb  Link  2 

q26 

0.0 

25.0 

0.0 

Rad  10.0 

36 

36 

Thumb  Tip 

0.0 

0.0 

0.0 

0.0 

Rad  8.0 

- 

Table  A.l:  The  Denavit-Hartenberg  kinematic  model  for  my  right  hand.  It 
was  calibrated  using  the  procedure  of  Sec.  2.2.3. 
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